PostgreSQL的中文全文检索

http://my.oschina .NET/Kenyon/blog/82305

http://www.54chen.com/_linux_/postgresql-bamboo-lucene-part2.html

上一篇介绍了postgresql全文检索的环境和一些示例，http://my.oschina .Net/Kenyon/blog/80904，都是基于其自带的模式，目前版本默认并不支持中文的全文检索，但是我们的实际使用过程中肯定会有用到中文的检索，好在有强大的社区支持，结合第三方工具可以简单实现PG的中文全文检索。

PG的中文全文检索步骤也主要分三步走：
1.将中文分词
2.转换分词,去掉无意义分词
3.按一定顺序排序，建索引加快查询

一、使用到的测试环境与工具
VMWARE 6.0
PostgreSQL 9.1.2
CRF++-0.57 下载地址：http://crfpp.googlecode.com/svn/trunk/doc/index.html
nlpbamboo-1.1.2 下载地址：http://code.google.com/p/nlpbamboo/downloads/list
index.tar.bz2 下载地址：http://code.google.com/p/nlpbamboo/downloads/list

二、部署过程(root用户)
1.先安装CRF

cd CRF++-0.57./configuremakemake install

2.安装nlpbamboo

cd nlpbamboomkdir buildcd buildcmake .. -DCMAKE_BUILD_TYPE=releasemake allmake install

3.下载分词数据库文件
下载index.tar.bz2,解压到/opt/bamboo/index

4.查看
安装完了后，到默认的安装路径下查看软件的安装情况,主要的默认路径
/usr/lib
/usr/include/
/opt/bamboo/

[postgres@localhost ~]$ cd /usr/local/lib[postgres@localhost lib]$ lltotal 788-rw-r--r--. 1 root root 516882 Sep  3 19:57 libcrfpp.a-rwxr-xr-x. 1 root root    952 Sep  3 19:57 libcrfpp.lalrwxrwxrwx. 1 root root     17 Sep  3 19:57 libcrfpp.so -> libcrfpp.so.0.0.0lrwxrwxrwx. 1 root root     17 Sep  3 19:57 libcrfpp.so.0 -> libcrfpp.so.0.0.0-rwxr-xr-x. 1 root root 280760 Sep  3 19:57 libcrfpp.so.0.0.0[postgres@localhost lib]$ cd /usr/lib[postgres@localhost lib]$ ll lib*-rw-r--r--. 1 root root 1027044 Sep  3 20:02 libbamboo.alrwxrwxrwx. 1 root root      14 Sep  3 20:03 libbamboo.so -> libbamboo.so.2-rwxr-xr-x. 1 root root  250140 Sep  3 20:02 libbamboo.so.2lrwxrwxrwx. 1 root root      25 Sep  3 23:56 libcrfpp.a -> /usr/local/lib/libcrfpp.alrwxrwxrwx. 1 root root      26 Sep  3 23:56 libcrfpp.so -> /usr/local/lib/libcrfpp.solrwxrwxrwx. 1 root root      28 Sep  3 23:56 libcrfpp.so.0 -> /usr/local/lib/libcrfpp.so.0[postgres@localhost bamboo]$ cd /opt/bamboo/[postgres@localhost bamboo]$ lltotal 17412drwxr-xr-x. 2 postgres postgres     4096 Sep  3 20:03 bindrwxr-xr-x. 2 postgres postgres     4096 Aug 15 01:52 etcdrwxr-xr-x. 4 postgres postgres     4096 Aug 15 01:52 extsdrwxr-sr-x. 2 postgres postgres     4096 Apr  1  2009 index-rw-r--r--. 1 postgres postgres 17804377 Sep  3 23:52 index.tar.bz2drwxr-xr-x. 2 postgres postgres     4096 Sep  3 20:03 processordrwxr-xr-x. 2 postgres postgres     4096 Aug 15 01:52 template

5.编辑中文检索干扰词汇
编辑该词汇是为了减少一些无意义的词汇被检索出来，比如'a',‘的','得'等

[postgres@localhost tsearch_data]$touch /usr/share/postgresql/8.4/tsearch_data/chinese_utf8.stop[postgres@localhost tsearch_data]$ pwd/home/postgres/share/tsearch_data[postgres@localhost tsearch_data]$ more chinese_utf8.stop 的我我们

6.编译

cd /opt/bamboo/exts/postgres/pg_tokenizemakemake installcd /opt/bamboo/exts/postgres/chinese_parsermakemake install

7.导入分词函数和分词模块

[postgres@localhost ~]$ psqlpostgres=# \i /home/postgres/share/contrib/pg_tokenize.sqlSETCREATE FUNCTIONpostgres=#  \i /home/postgres/share/contrib/chinese_parser.sqlSETCREATE FUNCTIONCREATE FUNCTIONCREATE FUNCTIONCREATE FUNCTIONCREATE TEXT SEARCH PARSERCREATE TEXT SEARCH CONFIGURATIONCREATE TEXT SEARCH DICTIONARYALTER TEXT SEARCH CONFIGURATION

8.安装完图形化展示

9.不同的DB安装中文分词
假如在一台机子上同时有多个DB，则只需要把分词函数和分词模版在新库里导入一下即可。

三、应用
1.使用tokens测试中文分词效果

[postgres@localhost ~]$ psql -p 5432psql (9.1.2)Type "help" for help.postgres=# select tokenize('中文词分浙江人民海量的人');            tokenize            ---------------------------------中文词 分 浙江 人民 海量 的 人(1 row)postgres=# SELECT to_tsvector('chinesecfg', '我爱北京天安门');            to_tsvector           -----------------------------------'北京':3 '天安门':4 '我':1 '爱':2(1 row)postgres=# select tokenize('南京市长江大桥');     tokenize     -------------------南京市 长江 大桥(1 row)postgres=# select tokenize('南京市长');  tokenize ------------南京 市长(1 row)

有一个比较好的分词效果，最明显的是南京市长江大桥，并没有被分成南京,市长,江大桥之类的。

2.使用一个普通测试表，新建一个tsvector列用来存放分词数据

ALTER TABLE t_store_adv add column index_col_ts tsvector;UPDATE t_store_adv SET index_col_ts =to_tsvector('chinesecfg', coalesce(adv_title,'') || ' ' || coalesce(adv_content,''));

3.建立索引

CREATE INDEX t_store_adv_idx ON t_store_adv USING gin(index_col_ts);

4.查询

[postgres@localhost ~]$ psql  -p 5432psql (9.1.2)Type "help" for help.postgres=# select count(1) from t_store_adv; count -------  38803(1 row)postgres=# SELECT count(1) FROM t_store_adv WHERE index_col_ts @@ to_tsquery('南京'); count -------    16(1 row)postgres=# explain SELECT count(1) FROM t_store_adv WHERE index_col_ts @@ to_tsquery('南京');                                      QUERY PLAN                                      -------------------------------------------------------------------------------------- Aggregate  (cost=108.61..108.62 rows=1 width=0)   ->  Bitmap Heap Scan on t_store_adv  (cost=12.21..108.55 rows=27 width=0)         Recheck Cond: (index_col_ts @@ to_tsquery('南京'::text))         ->  Bitmap Index Scan on t_store_adv_idx  (cost=0.00..12.21 rows=27 width=0)               Index Cond: (index_col_ts @@ to_tsquery('南京'::text))(5 rows)--普通的文本检索postgres=# select count(1) from t_store_adv where (adv_content like '%南京%' or adv_title like '%南京%'); count -------    17(1 row)postgres=# explain select count(1) from t_store_adv where (adv_content like '%南京%' or adv_title like '%南京%');                                             QUERY PLAN                                             ---------------------------------------------------------------------------------------------------- Aggregate  (cost=1348.05..1348.06 rows=1 width=0)   ->  Seq Scan on t_store_adv  (cost=0.00..1348.05 rows=1 width=0)         Filter: (((adv_content)::text ~~ '%南京%'::text) OR ((adv_title)::text ~~ '%南京%'::text))(3 rows)

本次测试的数据量不是很大，但从执行计划上可见一斑，所消耗的资源是要少很多的，当然存储会消耗多一点，数据量大的情况下，索引检索的效率也能看出有很大的提升，具体可参考一个例子：http://www.oschina.net/question/96003_19020

四、总结：
示例中略去了使用触发器来更新tsvector列。使用中文全文检索可以有效提升中文检索速度，只是目前还不是内置的，需要借助第三方工具手工安装一下，选择的分词方案也比较多，可以择优选择。

五、参考：
http://www.cnblogs.com/shuaixf/archive/2011/09/10/2173260.html
http://www.54chen.com/_linux_/postgresql-bamboo-lucene-part2.html

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。