Elasticsearch:简体繁体转换分词器 – STConvert analysis

技术文章10个月前发布 gyx131
188 0 0

我的前同事 Medcl 大神,在github上也创建了一个转换简体及繁体的分词器。这个在我们的很多的实际应用中也是非常有用的,比如当我的文档是繁体的,但是我们想用中文对它进行搜索。

安装

我们可以按照如下的方法来对这个分词器进行安装:

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-stconvert/releases/download/v8.2.3/elasticsearch-analysis-stconvert-8.2.3.zip

你可以根据发行的版本及自己的 Elasticsearch 版本来选择合适的版本来安装。

安装完这个插件后,我们必须注意的是:重新启动 Elasticsearch 集群。我们可以使用如下的命令来进行查看:

./bin/elasticsearch-plugin list
$ ./bin/elasticsearch-plugin list
analysis-stconvert

该插件包括如下的部分:

  • analyzer:stconvert
  • tokenizer: stconvert
  • token-filter:stconvert
  • char-filter: stconvert

它还支持如下的配置:

  • convert_type:默认值为 s2t,其它的选项为:
    • s2t:从简体中文转换为繁体中文
    • t2s:从繁体中文转换为简体中文
  • keep_both:默认为 false
  • delimiter:默认是以 , 为分隔符

例子

我们使用如下的例子来进行展示:

PUT /stconvert/
{
  "settings": {
    "analysis": {
      "analyzer": {
        "tsconvert": {
          "tokenizer": "tsconvert"
        }
      },
      "tokenizer": {
        "tsconvert": {
          "type": "stconvert",
          "delimiter": "#",
          "keep_both": false,
          "convert_type": "t2s"
        }
      },
      "filter": {
        "tsconvert": {
          "type": "stconvert",
          "delimiter": "#",
          "keep_both": false,
          "convert_type": "t2s"
        }
      },
      "char_filter": {
        "tsconvert": {
          "type": "stconvert",
          "convert_type": "t2s"
        }
      }
    }
  }
}

在上面,我们创建一个叫做 stconvert 的索引。它定义了一个叫做 tscovert 的 analyzer。

我们做如下的分词测试:

GET stconvert/_analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "char_filter" : ["tsconvert"],
  "text" : "国际國際"
}

上面的命令显示:

{
  "tokens" : [
    {
      "token" : "国际国际",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    }
  ]
}

我们可以使用如下的一个定制 analyzer 来对繁体字来进行分词:

PUT index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "tsconvert": {
          "type": "stconvert",
          "convert_type": "t2s"
        }
      },
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [
            "tsconvert"
          ],
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "foo": {
        "type": "keyword",
        "normalizer": "my_normalizer"
      }
    }
  }
}

我们使用如下的命令来写入一些文档:

PUT index/_doc/1
{
  "foo": "國際"
}
 
PUT index/_doc/2
{
  "foo": "国际"
}

在上面,我们定义了 foo 字段的分词器为 my_normalizer,那么上面的繁体字 “國際” 将被 char_filter 转换为 “国际”。我们使用如下的命令来进行搜索时:

GET index/_search
{
  "query": {
    "term": {
      "foo": "国际"
    }
  }
}

它返回的结果为:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.18232156,
    "hits" : [
      {
        "_index" : "index",
        "_id" : "1",
        "_score" : 0.18232156,
        "_source" : {
          "foo" : "國際"
        }
      },
      {
        "_index" : "index",
        "_id" : "2",
        "_score" : 0.18232156,
        "_source" : {
          "foo" : "国际"
        }
      }
    ]
  }
}
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.18232156,
    "hits" : [
      {
        "_index" : "index",
        "_id" : "1",
        "_score" : 0.18232156,
        "_source" : {
          "foo" : "國際"
        }
      },
      {
        "_index" : "index",
        "_id" : "2",
        "_score" : 0.18232156,
        "_source" : {
          "foo" : "国际"
        }
      }
    ]
  }
}

如果我们对它进行 term 搜索:

GET index/_search
{
  "query": {
    "term": {
      "foo": "國際"
    }
  }
}

它返回的结果为:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.18232156,
    "hits" : [
      {
        "_index" : "index",
        "_id" : "1",
        "_score" : 0.18232156,
        "_source" : {
          "foo" : "國際"
        }
      },
      {
        "_index" : "index",
        "_id" : "2",
        "_score" : 0.18232156,
        "_source" : {
          "foo" : "国际"
        }
      }
    ]
  }
}

我们甚至可以结合 IK 分词器来对繁体字进行分词:

PUT index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "tsconvert": {
          "type": "stconvert",
          "convert_type": "t2s"
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "tsconvert"
          ],
          "tokenizer": "ik_smart",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "foo": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

在上面,我们先对繁体字进行繁体到简体的转换,然后使用 ik 分词器对它进行分词,之后在进行小写。我们使用如下的命令来进行测试:

GET index/_analyze
{
  "analyzer": "my_analyzer", 
  "text": "我愛北京天安門"
}

上面命令的返回结果是:

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "爱",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "北京",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "天安门",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

我们还可以做另外一个测试:

GET index/_analyze
{
  "analyzer": "my_analyzer", 
  "text": "請輸入要轉換簡繁體的中文漢字"
}

结果是:

{
  "tokens" : [
    {
      "token" : "请",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "输入",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "要",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "转换",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "简繁体",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "的",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "中文",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "汉字",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 7
    }
  ]
}
© 版权声明

相关文章

暂无评论

暂无评论...