PostgreSQL 中文全文检索(之三):给 SCWS 中文词库添加新词

前两篇 blog 分别介绍了 zhparser 和 nlpbamboo 中文全文检索的部署和使用,但从分词效果来看都不十分理想。例如 “每家乐” 分词后“每”字会消失,”久久超市” 分词后变成”超市”,这会导致中文检索结果不匹配。

Zhparser 中文检索

1
2
3
4
5
6
7
8
9
10
11
francs=> select to_tsvector('testzhcfg','每家乐');  
to_tsvector
-------------
'家乐':1
(1 row)

francs=> select to_tsvector('testzhcfg','久久超市');
to_tsvector
-------------
'超市':1
(1 row)

备注:zhparser 用的是 SCWS ( Simple Chinese Word Segmentation ) 简易中文分词系统,是非常不错的中文分词解决方案,但有些词分词难免不尽如意,接下来介绍给 SCWS 添加新词的方法 。

词库 XDB 文件

1
2
[root@db ~]# ll /opt/pgsql_9.4beta3/share/tsearch_data/dict.utf8.xdb   
-rw-r--r-- 1 root root 20720423 Jan 11 17:41 /opt/pgsql_9.4beta3/share/tsearch_data/dict.utf8.xdb

备注:安装完 zhparser 插件后,会在 $PGHOME/share/tsearch_data 目录下生成 dict.utf8.xdb 词库文件,此文件格式不是文本,不能被编辑,好在 SCWS 提供词库导入导出工具。

下载 XDB导入导出工具
http://www.ftphp.com/scws/down/phptool_for_scws_xdb.zip
备注:解压后生成 dump_xdb_file.php 和 make_xdb_file.php 两个重要的工具,分别用来将词库文件导出到文本文件以及从文本文件导入到词库。

导出词库

1
2
3
4
[root@db phptool]# php dump_xdb_file.php   
Usage: dump_xdb_file.php <xdb file> [output file]

[root@db tsearch_data]# php /opt/soft_bak/phptool/dump_xdb_file.php dict.utf8.xdb xdb1.txt

备注:将词库文件 dict.utf8.xdb 导出到 xdb1.txt 文本文件。

修改 xdb1.txt,添加以下

1
2
3
4
# WORD TF IDF ATTR  
家乐 13.76 8.79 n
久久 14.47 5.42 n
久久超市 13.65 9.08 n

备注:文本文件有四个字段组成,用空格或制表符分隔, TF-IDF(term frequency– inverse document frequency),大概是加权作用,具体也不大清楚如何计算,不过有以下链接可以计算。

新词生词的TF/IDF计算器

http://www.xunsearch.com/scws/demo/get_tfidf.php

PostgreSQL 中文全文检索(之三):给 SCWS 中文词库添加新词

重新生成 XDB 文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
[root@db tsearch_data]# php /opt/soft_bak/phptool/make_xdb_file.php dict.utf8.xdb xdb1.txt
INFO: Loading text file data ... OK, Total words=550871
Inserting [00/64] ... 11332 Records saved.
Inserting [01/64] ... 11439 Records saved.
Inserting [02/64] ... 13655 Records saved.
Inserting [03/64] ... 10138 Records saved.
Inserting [04/64] ... 7911 Records saved.
Inserting [05/64] ... 3726 Records saved.
Inserting [06/64] ... 9940 Records saved.
Inserting [07/64] ... 3244 Records saved.
Inserting [08/64] ... 4799 Records saved.
Inserting [09/64] ... 15721 Records saved.
Inserting [10/64] ... 6042 Records saved.
Inserting [11/64] ... 4555 Records saved.
Inserting [12/64] ... 4709 Records saved.
Inserting [13/64] ... 5637 Records saved.
Inserting [14/64] ... 5762 Records saved.
Inserting [15/64] ... 4655 Records saved.
Inserting [16/64] ... 2053 Records saved.
Inserting [17/64] ... 855 Records saved.
Inserting [18/64] ... 9271 Records saved.
Inserting [19/64] ... 12463 Records saved.
Inserting [20/64] ... 6059 Records saved.
Inserting [21/64] ... 8448 Records saved.
Inserting [22/64] ... 9868 Records saved.
Inserting [23/64] ... 7665 Records saved.
Inserting [24/64] ... 7427 Records saved.
Inserting [25/64] ... 6301 Records saved.
Inserting [26/64] ... 3365 Records saved.
Inserting [27/64] ... 6683 Records saved.
Inserting [28/64] ... 40398 Records saved.
Inserting [29/64] ... 13978 Records saved.
Inserting [30/64] ... 17539 Records saved.
Inserting [31/64] ... 8222 Records saved.
Inserting [32/64] ... 6732 Records saved.
Inserting [33/64] ... 14588 Records saved.
Inserting [34/64] ... 9572 Records saved.
Inserting [35/64] ... 8870 Records saved.
Inserting [36/64] ... 8987 Records saved.
Inserting [37/64] ... 7027 Records saved.
Inserting [38/64] ... 6839 Records saved.
Inserting [39/64] ... 7160 Records saved.
Inserting [40/64] ... 5034 Records saved.
Inserting [41/64] ... 6856 Records saved.
Inserting [42/64] ... 15446 Records saved.
Inserting [43/64] ... 9458 Records saved.
Inserting [44/64] ... 6026 Records saved.
Inserting [45/64] ... 7510 Records saved.
Inserting [46/64] ... 6006 Records saved.
Inserting [47/64] ... 10465 Records saved.
Inserting [48/64] ... 12255 Records saved.
Inserting [49/64] ... 10853 Records saved.
Inserting [50/64] ... 15622 Records saved.
Inserting [51/64] ... 5880 Records saved.
Inserting [52/64] ... 10766 Records saved.
Inserting [53/64] ... 14227 Records saved.
Inserting [54/64] ... 4791 Records saved.
Inserting [55/64] ... 3618 Records saved.
Inserting [56/64] ... 4721 Records saved.
Inserting [57/64] ... 2279 Records saved.
Inserting [58/64] ... 5041 Records saved.
Inserting [59/64] ... 12033 Records saved.
Inserting [60/64] ... 11009 Records saved.
Inserting [61/64] ... 7840 Records saved.
Inserting [62/64] ... 6409 Records saved.
Inserting [63/64] ... 3091 Records saved.
INFO: optimizing ...
DONE!

中文检索测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
pg94@db-> psql francs francs
psql (9.4beta3)
Type "help" for help.

francs=> select to_tsvector('testzhcfg','每家乐');
to_tsvector
-------------
'每家乐':1
(1 row)

francs=> select to_tsvector('testzhcfg','久久超市');
to_tsvector
-------------------
'久久':1 '超市':2
(1 row)

francs=> select to_tsvector('testzhcfg','久久');
to_tsvector
-------------
'久久':1
(1 row)

备注:’久久超市’ 可以正常分词了,但需要退出之前会话,重新连接数据库才生效。

参考

原创文章,作者:ItWorker,如若转载,请注明出处:https://blog.ytso.com/239617.html

(0)
上一篇 2022年2月12日
下一篇 2022年2月12日

相关推荐

发表回复

登录后才能评论