我的前同事 Medcl 大神,在github上也创建了一个转换简体及繁体的分词器。这个在我们的很多的实际应用中也是非常有用的,比如当我的文档是繁体的,但是我们想用中文对它进行搜索。
安装
我们可以按照如下的方法来对这个分词器进行安装:
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-stconvert/releases/download/v8.2.3/elasticsearch-analysis-stconvert-8.2.3.zip
你可以根据发行的版本及自己的 Elasticsearch 版本来选择合适的版本来安装。
安装完这个插件后,我们必须注意的是:重新启动 Elasticsearch 集群。我们可以使用如下的命令来进行查看:
./bin/elasticsearch-plugin list
$ ./bin/elasticsearch-plugin list
analysis-stconvert
该插件包括如下的部分:
- analyzer:stconvert
- tokenizer: stconvert
- token-filter:stconvert
- char-filter: stconvert
它还支持如下的配置:
- convert_type:默认值为 s2t,其它的选项为:
- s2t:从简体中文转换为繁体中文
- t2s:从繁体中文转换为简体中文
- keep_both:默认为 false
- delimiter:默认是以 , 为分隔符
例子
我们使用如下的例子来进行展示:
PUT /stconvert/
{
"settings": {
"analysis": {
"analyzer": {
"tsconvert": {
"tokenizer": "tsconvert"
}
},
"tokenizer": {
"tsconvert": {
"type": "stconvert",
"delimiter": "#",
"keep_both": false,
"convert_type": "t2s"
}
},
"filter": {
"tsconvert": {
"type": "stconvert",
"delimiter": "#",
"keep_both": false,
"convert_type": "t2s"
}
},
"char_filter": {
"tsconvert": {
"type": "stconvert",
"convert_type": "t2s"
}
}
}
}
}
在上面,我们创建一个叫做 stconvert 的索引。它定义了一个叫做 tscovert 的 analyzer。
我们做如下的分词测试:
GET stconvert/_analyze
{
"tokenizer" : "keyword",
"filter" : ["lowercase"],
"char_filter" : ["tsconvert"],
"text" : "国际國際"
}
上面的命令显示:
{
"tokens" : [
{
"token" : "国际国际",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}
]
}
我们可以使用如下的一个定制 analyzer 来对繁体字来进行分词:
PUT index
{
"settings": {
"analysis": {
"char_filter": {
"tsconvert": {
"type": "stconvert",
"convert_type": "t2s"
}
},
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [
"tsconvert"
],
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"foo": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
我们使用如下的命令来写入一些文档:
PUT index/_doc/1
{
"foo": "國際"
}
PUT index/_doc/2
{
"foo": "国际"
}
在上面,我们定义了 foo 字段的分词器为 my_normalizer,那么上面的繁体字 “國際” 将被 char_filter 转换为 “国际”。我们使用如下的命令来进行搜索时:
GET index/_search
{
"query": {
"term": {
"foo": "国际"
}
}
}
它返回的结果为:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.18232156,
"hits" : [
{
"_index" : "index",
"_id" : "1",
"_score" : 0.18232156,
"_source" : {
"foo" : "國際"
}
},
{
"_index" : "index",
"_id" : "2",
"_score" : 0.18232156,
"_source" : {
"foo" : "国际"
}
}
]
}
}
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.18232156,
"hits" : [
{
"_index" : "index",
"_id" : "1",
"_score" : 0.18232156,
"_source" : {
"foo" : "國際"
}
},
{
"_index" : "index",
"_id" : "2",
"_score" : 0.18232156,
"_source" : {
"foo" : "国际"
}
}
]
}
}
如果我们对它进行 term 搜索:
GET index/_search
{
"query": {
"term": {
"foo": "國際"
}
}
}
它返回的结果为:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.18232156,
"hits" : [
{
"_index" : "index",
"_id" : "1",
"_score" : 0.18232156,
"_source" : {
"foo" : "國際"
}
},
{
"_index" : "index",
"_id" : "2",
"_score" : 0.18232156,
"_source" : {
"foo" : "国际"
}
}
]
}
}
我们甚至可以结合 IK 分词器来对繁体字进行分词:
PUT index
{
"settings": {
"analysis": {
"char_filter": {
"tsconvert": {
"type": "stconvert",
"convert_type": "t2s"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"tsconvert"
],
"tokenizer": "ik_smart",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"foo": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
在上面,我们先对繁体字进行繁体到简体的转换,然后使用 ik 分词器对它进行分词,之后在进行小写。我们使用如下的命令来进行测试:
GET index/_analyze
{
"analyzer": "my_analyzer",
"text": "我愛北京天安門"
}
上面命令的返回结果是:
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "爱",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "北京",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "天安门",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
}
]
}
我们还可以做另外一个测试:
GET index/_analyze
{
"analyzer": "my_analyzer",
"text": "請輸入要轉換簡繁體的中文漢字"
}
结果是:
{
"tokens" : [
{
"token" : "请",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "输入",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "要",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "转换",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "简繁体",
"start_offset" : 6,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "的",
"start_offset" : 9,
"end_offset" : 10,
"type" : "CN_CHAR",
"position" : 5
},
{
"token" : "中文",
"start_offset" : 10,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "汉字",
"start_offset" : 12,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 7
}
]
}
© 版权声明
文章版权归作者所有,未经允许请勿转载。
相关文章
暂无评论...