PostgreSQL 中文全文检索(之二):Nlpbamboo 安装及使用

上篇 blog 介绍了 PG 中文全文检索(之一): zhparser 安装及使用 , 这篇 blog 准备介绍 nlpbamboo,虽然这个项目已经不更新了,也准备安装测试下。

安装 Cmake

[pg94@db1 soft_bak]# yum install cmake

安装 CRF

下载CRF
https://code.google.com/p/crfpp/downloads/detail?name=CRF%2B%2B-0.57.tar.gz&can=2&q=

解压安装

1
2
3
4
5
[pg94@db1 CRF++-0.57]# tar xvf CRF++-0.57.tar.gz   
[pg94@db1 CRF++-0.57]# cd CRF++-0.57
[pg94@db1 CRF++-0.57]# ./configure
[pg94@db1 CRF++-0.57]# make
[pg94@db1 CRF++-0.57]# make install

安装 Nlpbamboo

下载 bamboo
https://code.google.com/p/nlpbamboo/downloads/list

解压安装

1
2
3
4
5
6
7
[pg94@db1 soft_bak]# tar xvf nlpbamboo-1.1.2.tar.bz2   
[pg94@db1 soft_bak]# cd nlpbamboo-1.1.2
[pg94@db1 nlpbamboo-1.1.2]# mkdir build
[pg94@db1 build]# cd build/
[pg94@db1 build]# cmake .. -DCMAKE_BUILD_TYPE=release
[pg94@db1 build]# make all
[pg94@db1 build]# make install

备注:根据日志, 安装目录为 /opt/bamboo/。

下载分词数据库文件
https://code.google.com/p/nlpbamboo/downloads/list

解压 index.tar.bz2 到 /opt/bamboo 目录

1
2
3
4
5
6
7
8
9
10
11
12
[pg94@db1 bamboo]# cd /opt/bamboo/  
[pg94@db1 bamboo]# tar jxvf index.tar.bz2

[pg94@db1 bamboo]# ll
total 24
drwxr-xr-x 2 root root 4096 Jan 7 11:20 bin
drwxr-xr-x 2 root root 4096 Jan 7 11:20 etc
drwxr-xr-x 4 root root 4096 Jan 7 11:20 exts
drwxr-sr-x 2 28852 users 4096 Apr 1 2009 index
drwxr-xr-x 2 root root 4096 Jan 7 11:20 processor
drwxr-xr-x 2 root root 4096 Jan 7 11:20 template
[pg94@db1 bamboo]#

创建中文索引停止词(干扰词)

1
[pg94@db1 bamboo]# touch /opt/pgsql_9.1.14/share/tsearch_data/chinese_utf8.stop

编译安装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[pg94@db1 bamboo]# cd /opt/bamboo/exts/postgres/pg_tokenize  
[pg94@db1 pg_tokenize]# source /home/pg93/.bash_profile
pg94@db1-> make
pg94@db1-> make install
/bin/mkdir -p '/opt/pgsql_9.4beta3/lib'
/bin/mkdir -p '/opt/pgsql_9.4beta3/share/contrib'
/usr/bin/install -c -m 755 pg_tokenize.so '/opt/pgsql_9.4beta3/lib/pg_tokenize.so'
/usr/bin/install -c -m 644 uninstall_pg_tokenize.sql pg_tokenize.sql '/opt/pgsql_9.4beta3/share/contrib/'

pg94@db1-> cd /opt/bamboo/exts/postgres/chinese_parser
pg94@db1-> make
pg94@db1-> make install
/bin/mkdir -p '/opt/pgsql_9.4beta3/lib'
/bin/mkdir -p '/opt/pgsql_9.4beta3/share/contrib'
/usr/bin/install -c -m 755 chinese_parser.so '/opt/pgsql_9.4beta3/lib/chinese_parser.so'
/usr/bin/install -c -m 644 uninstall_chinese_parser.sql chinese_parser.sql '/opt/pgsql_9.4beta3/share/contrib/'

导入分词模块

1
2
3
4
5
6
7
8
9
10
11
12
13
14
pg94@db1-> cd /opt/pgsql_9.4beta3/share/contrib  
pg94@db1-> psql -h 127.0.0.1 -d francs -f pg_tokenize.sql
SET
CREATE FUNCTION
pg94@db1-> psql -h 127.0.0.1 -d francs -f chinese_parser.sql
SET
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE TEXT SEARCH PARSER
CREATE TEXT SEARCH CONFIGURATION
CREATE TEXT SEARCH DICTIONARY
ALTER TEXT SEARCH CONFIGURATION

功能测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[pg94@db1 contrib]$ psql db_test  
psql (9.3.3)
Type "help" for help.

francs=> /dF chinesecfg
List of text search configurations
Schema | Name | Description
------------+------------+-------------
pg_catalog | chinesecfg |
(1 row)


db_test=# select tokenize('山坑休闲钓鱼山庄');
tokenize
----------------------
山坑 休闲 钓鱼 山庄
(1 row)

db_test=# SELECT to_tsvector('chinesecfg', '山坑休闲钓鱼山庄');
to_tsvector
-------------------------------------
'休闲':2 '山坑':1 '山庄':4 '钓鱼':3
(1 row)

francs=> SELECT to_tsvector('chinesecfg', '咬不得生煎');
to_tsvector
---------------------
'咬不得':1 '生煎':2
(1 row)

性能测试

测试环境
IBM BladeCenter HS22
8 核8G
CentOS release 6.5
PostgreSQL 9.4beta3
cmake 2.8.12.2
CRF++-0.57
nlpbamboo-1.1.2

测试表: 300 万数据

1
2
3
4
5
francs=>  select count(*)  from tbl_store ;  
count
---------
2994433
(1 row)

创建 GIN 索引

1
2
3
francs=>create index idx_gin_tbl_store_name on tbl_store using gin(to_tsvector('chinesecfg',name));  
CREATE INDEX
Time: 147735.647 ms

sql

1
2
3
4
5
6
7
8
9
10
11
francs=> explain analyze select name,type,city from tbl_store where to_tsvector('chinesecfg',name)  @@ to_tsquery('chinesecfg','易买得长江店');  
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl_store (cost=68.00..72.02 rows=1 width=44) (actual time=0.431..0.434 rows=2 loops=1)
Recheck Cond: (to_tsvector('chinesecfg'::regconfig, (name)::text) @@ '''易'' & ''买'' & ''得'' & ''长江店'''::tsquery)
Heap Blocks: exact=2
-> Bitmap Index Scan on idx_gin_tbl_store_name (cost=0.00..68.00 rows=1 width=0) (actual time=0.418..0.418 rows=2 loops=1)
Index Cond: (to_tsvector('chinesecfg'::regconfig, (name)::text) @@ '''易'' & ''买'' & ''得'' & ''长江店'''::tsquery)
Planning time: 0.361 ms
Execution time: 0.493 ms
(7 rows)

备注:性能非常不错。

常见错误

以下总结了 nlpbamboo 安装过程中的典型错误,如下:

错误一: 导入 pg_tokenize.sql 报错

1
2
[pg94@db1 contrib]$ psql -h 127.0.0.1  -p 1921  -d francs -U postgres -f pg_tokenize.sql   
psql:pg_tokenize.sql:2: ERROR:could not load library "/opt/pgsql_9.4beta3/lib/pg_tokenize.so": /opt/pgsql_9.4beta3/lib/pg_tokenize.so: undefined symbol: DirectFunctionCall1

解决方法: 修改 Makefile 的 SHLIB_LINK 这行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cd /opt/bamboo/exts/postgres/pg_tokenize  

[pg94@db1 pg_tokenize]# cat Makefile
MODULE_big = pg_tokenize
OBJS = pg_tokenize.o

DATA_built = pg_tokenize.sql
DATA = uninstall_pg_tokenize.sql

PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
PG_CPPFLAGS += -DNDEBUG -I/usr/include/bamboo
#SHLIB_LINK = -L/usr/lib -lbamboo -lstdc++
SHLIB_LINK = -L/usr/lib -I /opt/soft_bak/postgresql-9.4beta3/src/include -lbamboo -lstdc++
include $(PGXS)

备注:之后重新编译安装即可。

错误二: 导入 pg_tokenize.sql 报错

1
2
3
pg94@db1-> psql -h 127.0.0.1  -d francs -f pg_tokenize.sql   
SET
psql:pg_tokenize.sql:2: ERROR:can not init tokenize

解决方法
解压 index.tar.bz2 到 /opt/bamboo 目录。

[pg94@db1 bamboo]# ll  
total 24  
drwxr-xr-x 2 root root 4096  Jan 7  12:13 bin  
drwxr-xr-x 2 root root 4096  Jan 7  12:13 etc  
drwxr-xr-x 4 root root 4096  Jan 7  12:13 exts  
drwxr-sr-x 2  28852 users 4096  Apr 1 2009 index  
drwxr-xr-x 2 root root 4096  Jan 7  12:13 processor  
drwxr-xr-x 2 root root 4096  Jan 7  12:13  template

参考

原创文章,作者:奋斗,如若转载,请注明出处:https://blog.ytso.com/239616.html

(0)
上一篇 2022年2月12日
下一篇 2022年2月12日

相关推荐

发表回复

登录后才能评论