上篇 blog 介绍了 PG 中文全文检索(之一): zhparser 安装及使用 , 这篇 blog 准备介绍 nlpbamboo,虽然这个项目已经不更新了,也准备安装测试下。
安装 Cmake
[pg94@db1 soft_bak]# yum install cmake
安装 CRF
下载CRFhttps://code.google.com/p/crfpp/downloads/detail?name=CRF%2B%2B-0.57.tar.gz&can=2&q=
解压安装
1 2 3 4 5
[pg94@db1 CRF++-0.57 ]# tar xvf CRF++-0.57.tar.gz [pg94@db1 CRF++-0.57 ]# cd CRF++-0.57 [pg94@db1 CRF++-0.57 ]# ./configure [pg94@db1 CRF++-0.57 ]# make [pg94@db1 CRF++-0.57 ]# make install
安装 Nlpbamboo
下载 bamboohttps://code.google.com/p/nlpbamboo/downloads/list
解压安装
1 2 3 4 5 6 7
[pg94@db1 soft_bak]# tar xvf nlpbamboo-1.1.2.tar.bz2 [pg94@db1 soft_bak]# cd nlpbamboo-1.1.2 [pg94@db1 nlpbamboo-1.1 .2 ]# mkdir build [pg94@db1 build]# cd build/ [pg94@db1 build]# cmake .. -DCMAKE_BUILD_TYPE=release [pg94@db1 build]# make all [pg94@db1 build]# make install
备注:根据日志, 安装目录为 /opt/bamboo/。
下载分词数据库文件https://code.google.com/p/nlpbamboo/downloads/list
解压 index.tar.bz2 到 /opt/bamboo 目录
1 2 3 4 5 6 7 8 9 10 11 12
[pg94@db1 bamboo] [pg94@db1 bamboo] [pg94@db1 bamboo] total 24 drwxr-xr-x 2 root root 4096 Jan 7 11:20 bin drwxr-xr-x 2 root root 4096 Jan 7 11:20 etc drwxr-xr-x 4 root root 4096 Jan 7 11:20 exts drwxr-sr-x 2 28852 users 4096 Apr 1 2009 index drwxr-xr-x 2 root root 4096 Jan 7 11:20 processor drwxr-xr-x 2 root root 4096 Jan 7 11:20 template [pg94@db1 bamboo]
创建中文索引停止词(干扰词)
1
[pg94 @db1 bamboo]# touch /opt/pgsql_9 .1. 14 /share/tsearch_data/chinese_utf8 .stop
编译安装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
[pg94@db1 bamboo]# cd /opt /bamboo/exts/postgres/pg_tokenize [pg94@db1 pg_tokenize]# source /home/pg93/.bash_profile pg94@db1-> make pg94@db1-> make install /bin/mkdir -p '/opt/pgsql_9.4beta3/lib' /bin/mkdir -p '/opt/pgsql_9.4beta3/share/contrib' /usr/bin/install -c -m 755 pg_tokenize.so '/opt/pgsql_9.4beta3/lib/pg_tokenize.so' /usr/bin/install -c -m 644 uninstall_pg_tokenize.sql pg_tokenize.sql '/opt/pgsql_9.4beta3/share/contrib/' pg94@db1-> cd /opt /bamboo/exts/postgres/chinese_parser pg94@db1-> make pg94@db1-> make install /bin/mkdir -p '/opt/pgsql_9.4beta3/lib' /bin/mkdir -p '/opt/pgsql_9.4beta3/share/contrib' /usr/bin/install -c -m 755 chinese_parser.so '/opt/pgsql_9.4beta3/lib/chinese_parser.so' /usr/bin/install -c -m 644 uninstall_chinese_parser.sql chinese_parser.sql '/opt/pgsql_9.4beta3/share/contrib/'
导入分词模块
1 2 3 4 5 6 7 8 9 10 11 12 13 14
pg94@db1-> cd /opt/pgsql_9.4beta3/share/contrib pg94@db1-> psql -h 127.0.0.1 -d francs -f pg_tokenize.sql SET CREATE FUNCTION pg94@db1-> psql -h 127.0 .0 .1 -d francs -f chinese_parser.sql SET CREATE FUNCTION CREATE FUNCTION CREATE FUNCTION CREATE FUNCTION CREATE TEXT SEARCH PARSER CREATE TEXT SEARCH CONFIGURATION CREATE TEXT SEARCH DICTIONARY ALTER TEXT SEARCH CONFIGURATION
功能测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
[pg94@db1 contrib]$ psql db_test psql (9.3 .3 ) Type "help" for help . francs=> /dF chinesecfg List of text search configurations Schema | Name | Description ------------+------------+------------- pg_catalog | chinesecfg | (1 row) db_test=# select tokenize ('山坑休闲钓鱼山庄' ); tokenize ---------------------- 山坑 休闲 钓鱼 山庄 (1 row) db_test=# SELECT to_tsvector('chinesecfg' , '山坑休闲钓鱼山庄' ); to_tsvector ------------------------------------- '休闲' :2 '山坑' :1 '山庄' :4 '钓鱼' :3 (1 row) francs=> SELECT to_tsvector('chinesecfg' , '咬不得生煎' ); to_tsvector --------------------- '咬不得' :1 '生煎' :2 (1 row)
性能测试
测试环境 IBM BladeCenter HS22 8 核8G CentOS release 6.5 PostgreSQL 9.4beta3 cmake 2.8.12.2 CRF++-0.57 nlpbamboo-1.1.2
测试表: 300 万数据
1 2 3 4 5
francs=> select count(* ) from tbl_store count --------- 2994433 (1 row)
创建 GIN 索引
1 2 3
francs=>create index idx_gin_tbl_store_name on tbl_store using gin(to_tsvector('chinesecfg',name)); CREATE INDEX Time: 147735.647 ms
sql
1 2 3 4 5 6 7 8 9 10 11
francs=> explain analyze select name,type ,city from tbl_store where to_tsvector('chinesecfg' ,name) @@ to_tsquery('chinesecfg' ,'易买得长江店' ); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on tbl_store (cost=68.00 ..72 .02 rows =1 width=44 ) (actual time=0.431 ..0 .434 rows =2 loops=1 ) Recheck Cond: (to_tsvector('chinesecfg' ::regconfig , (name)::text ) @@ '' '易' ' & ' '买' ' & ' '得' ' & ' '长江店' '' ::tsquery ) Heap Blocks: exact=2 -> Bitmap Index Scan on idx_gin_tbl_store_name (cost=0.00 ..68 .00 rows =1 width=0 ) (actual time=0.418 ..0 .418 rows =2 loops=1 ) Index Cond: (to_tsvector('chinesecfg' ::regconfig , (name)::text ) @@ '' '易' ' & ' '买' ' & ' '得' ' & ' '长江店' '' ::tsquery ) Planning time: 0.361 ms Execution time: 0.493 ms (7 rows )
备注:性能非常不错。
常见错误
以下总结了 nlpbamboo 安装过程中的典型错误,如下:
错误一: 导入 pg_tokenize.sql 报错
1 2
[pg94@db1 contrib]$ psql -h 127.0 .0.1 -p 1921 -d francs -U postgres -f pg_tokenize.sql psql: pg_tokenize.sql: 2 : ERROR: could not load library "/opt/pgsql_9.4beta3/lib/pg_tokenize.so" : /opt/pgsql _9.4 beta3/lib /pg_tokenize .so : undefined symbol : DirectFunctionCall1
解决方法: 修改 Makefile 的 SHLIB_LINK 这行
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
cd /opt/bamboo/exts/postgres/pg_tokenize [pg94@db1 pg_tokenize] MODULE_big = pg_tokenize OBJS = pg_tokenize.o DATA_built = pg_tokenize.sql DATA = uninstall_pg_tokenize.sql PG_CONFIG = pg_config PGXS := $(shell $(PG_CONFIG) --pgxs) PG_CPPFLAGS += -DNDEBUG -I/usr/include /bamboo SHLIB_LINK = -L/usr/lib -I /opt/soft_bak/postgresql-9.4beta3/src/include -lbamboo -lstdc++ include $(PGXS)
备注:之后重新编译安装即可。
错误二: 导入 pg_tokenize.sql 报错
1 2 3
pg94@db 1-> psql -h 127.0 .0 .1 -d francs -f pg_tokenize.sql SET psql: pg_tokenize.sql: 2 : ERROR: can not init tokenize
解决方法 解压 index.tar.bz2 到 /opt/bamboo 目录。
[pg94@db1 bamboo]# ll
total 24
drwxr-xr-x 2 root root 4096 Jan 7 12:13 bin
drwxr-xr-x 2 root root 4096 Jan 7 12:13 etc
drwxr-xr-x 4 root root 4096 Jan 7 12:13 exts
drwxr-sr-x 2 28852 users 4096 Apr 1 2009 index
drwxr-xr-x 2 root root 4096 Jan 7 12:13 processor
drwxr-xr-x 2 root root 4096 Jan 7 12:13 template
参考
原创文章,作者:奋斗,如若转载,请注明出处:https://blog.ytso.com/239616.html