随着数字化文本数据的大量产生,对提取文本信息工具的需求与日俱增。TopWORDS (Top-down WORd Discovery and Segmentation) 是由清华大学统计学研究中心邓柯教授实验室研制推出的一套无监督的文本分词方法,能够同时实现高效的文本分词新词发现。特别地,它在领域特定、包含大量未知或不规则的词语、短语、术语的中文文本处理中卓有成效。






Fig: Flowchart of TopWORDS

TopWORDS can achieve efficient “word discovery” and “text segmentation” simultaneously for any text sequence encoded in utf-8. It's particularly useful for processing domain-specific Chinese texts which contains a large number of unknown or irregular words, phrases and technical terms. In case that the target texts are very different from available training corpus, TopWORDS often outperforms popularly used methods based on supervised learning. The method can also be readily applied to English and other alphabet-based languages for discovering regular expressions, idioms, and special names by treating each English word as a character, which can be more informative for downstream analyses.

The figure below illustrates the architecture of the TopWORDS algorithm. More details about the approach can be found in our paper.

Reference: Deng, K., Bol, P. K., Li, K. J., & Liu, J. S. (2016). On the unsupervised analysis of domain-specific Chinese texts. Proceedings of the National Academy of Sciences of the United States of America, 113(22), 6154–6159.