http://my.oschina.NET/Kenyon/blog/82305
http://www.54chen.com/_linux_/postgresql-bamboo-lucene-part2.html
上一篇介绍了postgresql全文检索的环境和一些示例,http://my.oschina.Net/Kenyon/blog/80904,都是基于其自带的模式,目前版本默认并不支持中文的全文检索,但是我们的实际使用过程中肯定会有用到中文的检索,好在有强大的社区支持,结合第三方工具可以简单实现PG的中文全文检索。
cd CRF++-0.57./configuremakemake install2.安装nlpbamboo
cd nlpbamboomkdir buildcd buildcmake .. -DCMAKE_BUILD_TYPE=releasemake allmake install3.下载分词数据库文件
[postgres@localhost ~]$ cd /usr/local/lib[postgres@localhost lib]$ lltotal 788-rw-r--r--. 1 root root 516882 Sep 3 19:57 libcrfpp.a-rwxr-xr-x. 1 root root 952 Sep 3 19:57 libcrfpp.lalrwxrwxrwx. 1 root root 17 Sep 3 19:57 libcrfpp.so -> libcrfpp.so.0.0.0lrwxrwxrwx. 1 root root 17 Sep 3 19:57 libcrfpp.so.0 -> libcrfpp.so.0.0.0-rwxr-xr-x. 1 root root 280760 Sep 3 19:57 libcrfpp.so.0.0.0[postgres@localhost lib]$ cd /usr/lib[postgres@localhost lib]$ ll lib*-rw-r--r--. 1 root root 1027044 Sep 3 20:02 libbamboo.alrwxrwxrwx. 1 root root 14 Sep 3 20:03 libbamboo.so -> libbamboo.so.2-rwxr-xr-x. 1 root root 250140 Sep 3 20:02 libbamboo.so.2lrwxrwxrwx. 1 root root 25 Sep 3 23:56 libcrfpp.a -> /usr/local/lib/libcrfpp.alrwxrwxrwx. 1 root root 26 Sep 3 23:56 libcrfpp.so -> /usr/local/lib/libcrfpp.solrwxrwxrwx. 1 root root 28 Sep 3 23:56 libcrfpp.so.0 -> /usr/local/lib/libcrfpp.so.0[postgres@localhost bamboo]$ cd /opt/bamboo/[postgres@localhost bamboo]$ lltotal 17412drwxr-xr-x. 2 postgres postgres 4096 Sep 3 20:03 bindrwxr-xr-x. 2 postgres postgres 4096 Aug 15 01:52 etcdrwxr-xr-x. 4 postgres postgres 4096 Aug 15 01:52 extsdrwxr-sr-x. 2 postgres postgres 4096 Apr 1 2009 index-rw-r--r--. 1 postgres postgres 17804377 Sep 3 23:52 index.tar.bz2drwxr-xr-x. 2 postgres postgres 4096 Sep 3 20:03 processordrwxr-xr-x. 2 postgres postgres 4096 Aug 15 01:52 template5.编辑中文检索干扰词汇
[postgres@localhost tsearch_data]$touch /usr/share/postgresql/8.4/tsearch_data/chinese_utf8.stop[postgres@localhost tsearch_data]$ pwd/home/postgres/share/tsearch_data[postgres@localhost tsearch_data]$ more chinese_utf8.stop 的我我们6.编译
cd /opt/bamboo/exts/postgres/pg_tokenizemakemake installcd /opt/bamboo/exts/postgres/chinese_parsermakemake install7.导入分词函数和分词模块
[postgres@localhost ~]$ psqlpostgres=# \i /home/postgres/share/contrib/pg_tokenize.sqlSETCREATE FUNCTIONpostgres=# \i /home/postgres/share/contrib/chinese_parser.sqlSETCREATE FUNCTIONCREATE FUNCTIONCREATE FUNCTIONCREATE FUNCTIONCREATE TEXT SEARCH PARSERCREATE TEXT SEARCH CONFIGURATIONCREATE TEXT SEARCH DICTIONARYALTER TEXT SEARCH CONFIGURATION8.安装完图形化展示
[postgres@localhost ~]$ psql -p 5432psql (9.1.2)Type "help" for help.postgres=# select tokenize('中文词分浙江人民海量的人'); tokenize ---------------------------------中文词 分 浙江 人民 海量 的 人(1 row)postgres=# SELECT to_tsvector('chinesecfg', '我爱北京天安门'); to_tsvector -----------------------------------'北京':3 '天安门':4 '我':1 '爱':2(1 row)postgres=# select tokenize('南京市长江大桥'); tokenize -------------------南京市 长江 大桥(1 row)postgres=# select tokenize('南京市长'); tokenize ------------南京 市长(1 row)有一个比较好的分词效果,最明显的是南京市长江大桥,并没有被分成南京,市长,江大桥之类的。
ALTER TABLE t_store_adv add column index_col_ts tsvector;UPDATE t_store_adv SET index_col_ts =to_tsvector('chinesecfg', coalesce(adv_title,'') || ' ' || coalesce(adv_content,''));3.建立索引
CREATE INDEX t_store_adv_idx ON t_store_adv USING gin(index_col_ts);4.查询
[postgres@localhost ~]$ psql -p 5432psql (9.1.2)Type "help" for help.postgres=# select count(1) from t_store_adv; count ------- 38803(1 row)postgres=# SELECT count(1) FROM t_store_adv WHERE index_col_ts @@ to_tsquery('南京'); count ------- 16(1 row)postgres=# explain SELECT count(1) FROM t_store_adv WHERE index_col_ts @@ to_tsquery('南京'); QUERY PLAN -------------------------------------------------------------------------------------- Aggregate (cost=108.61..108.62 rows=1 width=0) -> Bitmap Heap Scan on t_store_adv (cost=12.21..108.55 rows=27 width=0) Recheck Cond: (index_col_ts @@ to_tsquery('南京'::text)) -> Bitmap Index Scan on t_store_adv_idx (cost=0.00..12.21 rows=27 width=0) Index Cond: (index_col_ts @@ to_tsquery('南京'::text))(5 rows)--普通的文本检索postgres=# select count(1) from t_store_adv where (adv_content like '%南京%' or adv_title like '%南京%'); count ------- 17(1 row)postgres=# explain select count(1) from t_store_adv where (adv_content like '%南京%' or adv_title like '%南京%'); QUERY PLAN ---------------------------------------------------------------------------------------------------- Aggregate (cost=1348.05..1348.06 rows=1 width=0) -> Seq Scan on t_store_adv (cost=0.00..1348.05 rows=1 width=0) Filter: (((adv_content)::text ~~ '%南京%'::text) OR ((adv_title)::text ~~ '%南京%'::text))(3 rows)本次测试的数据量不是很大,但从执行计划上可见一斑,所消耗的资源是要少很多的,当然存储会消耗多一点,数据量大的情况下,索引检索的效率也能看出有很大的提升,具体可参考一个例子:http://www.oschina.net/question/96003_19020
联系客服