Blog
About

27
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      λ-active learning based microblog-oriented Chinese word segmentation

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Current manual segmented microblog-oriented corpora are inadequate, so both conventional Chinese word segmentation (CWS) systems and deep learning based CWS systems are still not very effective. This paper presents an active learning method that selects samples with high annotation values from unlabelled tweets for microblog-oriented CWS. A parameter is introduced to control the number of repeatedly selected samples that offen occur in microblog data. Three strategies (Max, Avg and AvgMax) are used to evaluate the overall values of each sample. The initial segment character is a stop character which is calculated by taking character embeddings into consideration. Tests demonstrate that this method outperforms the baseline system with F Gains of 0.84%~1.49% and state-of-the-art active learning method word boundary annotation (WBA).

          Abstract

          摘要 由于面向中文微博的分词标注语料相对较少, 导致基于传统方法和深度学习方法的中文分词系统在微博语料上的表现效果很差。针对此问题, 该文提出一种主动学习方法, 从大规模未标注语料中挑选更具标注价值的微博分词语料。根据微博语料的特点, 在主动学习迭代过程中引入参数 λ来控制所选的重复样例的个数, 以确保所选样例的多样性; 同时, 根据样例中字标注结果的不确定性和上下文的多样性, 采用Max、Avg和AvgMax这3种策略衡量样例整体的标注价值; 此外, 用于主动学习的初始分词器除使用当前字的上下文作为特征外, 还利用字向量自动计算当前字成为停用字的可能性作为模型的特征。实验结果表明:该方法的 F值比基线系统提高了0.84%~1.49%, 比目前最优的基于词边界标注 (word boundary annotation, WBA) 的主动学习方法提升效果更好。

          Related collections

          Author and article information

          Journal
          J Tsinghua Univ (Sci & Technol)
          Journal of Tsinghua University (Science and Technology)
          Tsinghua University Press
          1000-0054
          15 March 2018
          14 March 2018
          : 58
          : 3
          : 260-265
          Affiliations
          1School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
          Author notes
          *Corresponding author: HUANG Degen, E-mail: huangdg@ 123456dlut.edu.cn
          Article
          j.cnki.qhdxxb.2018.26.011
          10.16511/j.cnki.qhdxxb.2018.26.011
          Copyright © Journal of Tsinghua University

          This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 Unported License (CC BY-NC 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See https://creativecommons.org/licenses/by-nc/4.0/.

          Comments

          Comment on this article