前两篇 blog 分别介绍了 zhparser 和 nlpbamboo 中文全文检索的部署和使用,但从分词效果来看都不十分理想。例如 “每家乐” 分词后“每”字会消失,”久久超市” 分词后变成”超市”,这会导致中文检索结果不匹配。
Zhparser 中文检索
1 2 3 4 5 6 7 8 9 10 11
|
francs=> select to_tsvector('testzhcfg','每家乐'); to_tsvector ------------- '家乐':1 (1 row) francs=> select to_tsvector('testzhcfg','久久超市'); to_tsvector ------------- '超市':1 (1 row)
|
备注:zhparser 用的是 SCWS ( Simple Chinese Word Segmentation ) 简易中文分词系统,是非常不错的中文分词解决方案,但有些词分词难免不尽如意,接下来介绍给 SCWS 添加新词的方法 。
词库 XDB 文件
1 2
|
[root@db ~] -rw-r--r-- 1 root root 20720423 Jan 11 17:41 /opt/pgsql_9.4beta3/share/tsearch_data/dict.utf8.xdb
|
备注:安装完 zhparser 插件后,会在 $PGHOME/share/tsearch_data 目录下生成 dict.utf8.xdb 词库文件,此文件格式不是文本,不能被编辑,好在 SCWS 提供词库导入导出工具。
下载 XDB导入导出工具
http://www.ftphp.com/scws/down/phptool_for_scws_xdb.zip
备注:解压后生成 dump_xdb_file.php 和 make_xdb_file.php 两个重要的工具,分别用来将词库文件导出到文本文件以及从文本文件导入到词库。
导出词库
1 2 3 4
|
[root@db phptool] Usage: dump_xdb_file.php <xdb file> [output file] [root@db tsearch_data]
|
备注:将词库文件 dict.utf8.xdb 导出到 xdb1.txt 文本文件。
修改 xdb1.txt,添加以下
1 2 3 4
|
# WORD TF IDF ATTR 家乐 13.76 8.79 n 久久 14.47 5.42 n 久久超市 13.65 9.08 n
|
备注:文本文件有四个字段组成,用空格或制表符分隔, TF-IDF(term frequency– inverse document frequency),大概是加权作用,具体也不大清楚如何计算,不过有以下链接可以计算。
新词生词的TF/IDF计算器
http://www.xunsearch.com/scws/demo/get_tfidf.php
重新生成 XDB 文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
|
[root@db tsearch_data]# php /opt/soft_bak/phptool/make_xdb_file.php dict.utf8.xdb xdb1.txt INFO: Loading text file data ... OK, Total words=550871 Inserting [00/64] ... 11332 Records saved. Inserting [01/64] ... 11439 Records saved. Inserting [02/64] ... 13655 Records saved. Inserting [03/64] ... 10138 Records saved. Inserting [04/64] ... 7911 Records saved. Inserting [05/64] ... 3726 Records saved. Inserting [06/64] ... 9940 Records saved. Inserting [07/64] ... 3244 Records saved. Inserting [08/64] ... 4799 Records saved. Inserting [09/64] ... 15721 Records saved. Inserting [10/64] ... 6042 Records saved. Inserting [11/64] ... 4555 Records saved. Inserting [12/64] ... 4709 Records saved. Inserting [13/64] ... 5637 Records saved. Inserting [14/64] ... 5762 Records saved. Inserting [15/64] ... 4655 Records saved. Inserting [16/64] ... 2053 Records saved. Inserting [17/64] ... 855 Records saved. Inserting [18/64] ... 9271 Records saved. Inserting [19/64] ... 12463 Records saved. Inserting [20/64] ... 6059 Records saved. Inserting [21/64] ... 8448 Records saved. Inserting [22/64] ... 9868 Records saved. Inserting [23/64] ... 7665 Records saved. Inserting [24/64] ... 7427 Records saved. Inserting [25/64] ... 6301 Records saved. Inserting [26/64] ... 3365 Records saved. Inserting [27/64] ... 6683 Records saved. Inserting [28/64] ... 40398 Records saved. Inserting [29/64] ... 13978 Records saved. Inserting [30/64] ... 17539 Records saved. Inserting [31/64] ... 8222 Records saved. Inserting [32/64] ... 6732 Records saved. Inserting [33/64] ... 14588 Records saved. Inserting [34/64] ... 9572 Records saved. Inserting [35/64] ... 8870 Records saved. Inserting [36/64] ... 8987 Records saved. Inserting [37/64] ... 7027 Records saved. Inserting [38/64] ... 6839 Records saved. Inserting [39/64] ... 7160 Records saved. Inserting [40/64] ... 5034 Records saved. Inserting [41/64] ... 6856 Records saved. Inserting [42/64] ... 15446 Records saved. Inserting [43/64] ... 9458 Records saved. Inserting [44/64] ... 6026 Records saved. Inserting [45/64] ... 7510 Records saved. Inserting [46/64] ... 6006 Records saved. Inserting [47/64] ... 10465 Records saved. Inserting [48/64] ... 12255 Records saved. Inserting [49/64] ... 10853 Records saved. Inserting [50/64] ... 15622 Records saved. Inserting [51/64] ... 5880 Records saved. Inserting [52/64] ... 10766 Records saved. Inserting [53/64] ... 14227 Records saved. Inserting [54/64] ... 4791 Records saved. Inserting [55/64] ... 3618 Records saved. Inserting [56/64] ... 4721 Records saved. Inserting [57/64] ... 2279 Records saved. Inserting [58/64] ... 5041 Records saved. Inserting [59/64] ... 12033 Records saved. Inserting [60/64] ... 11009 Records saved. Inserting [61/64] ... 7840 Records saved. Inserting [62/64] ... 6409 Records saved. Inserting [63/64] ... 3091 Records saved. INFO: optimizing ... DONE!
|
中文检索测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
|
pg94@db-> psql francs francs psql (9.4beta3) Type "help" for help.
francs=> select to_tsvector('testzhcfg','每家乐'); to_tsvector
'每家乐':1 (1 row)
francs=> select to_tsvector('testzhcfg','久久超市'); to_tsvector
'久久':1 '超市':2 (1 row)
francs=> select to_tsvector('testzhcfg','久久'); to_tsvector
'久久':1 (1 row)
|
备注:’久久超市’ 可以正常分词了,但需要退出之前会话,重新连接数据库才生效。
参考
原创文章,作者:ItWorker,如若转载,请注明出处:https://blog.ytso.com/239617.html