17
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Inverted Bilingual Topic Models for Lexicon Extraction from Non-parallel Data

      Preprint

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          A good lexicon is an important resource for various cross-lingual tasks such as information retrieval and text mining. In this paper, we focus on extracting translation pairs from non-parallel cross-lingual corpora. Previous lexicon extraction algorithms for non-parallel data generally rely on an accurate seed dictionary and extract translation pairs by using context similarity. However, there are two problems. One, a lot of semantic information is lost if we just use seed dictionary words to construct context vectors and obtain the context similarity. Two, in practice, we may not have a clean seed dictionary. For example, if we use a generic dictionary as a seed dictionary in a special domain, it might be very noisy. To solve these two problems, we propose two new bilingual topic models to better capture the semantic information of each word while discriminating the multiple translations in a noisy seed dictionary. We then use an effective measure to evaluate the similarity of words in different languages and select the optimal translation pairs. Results of experiments using real Japanese-English data demonstrate the effectiveness of our models.

          Related collections

          Most cited references5

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          Unsupervised prediction of citation influences

            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Learning Crosslingual Word Embeddings without Bilingual Corpora

            Crosslingual word embeddings represent lexical items from different languages in the same vector space, enabling transfer of NLP tools. However, previous attempts had expensive resource requirements, difficulty incorporating monolingual data or were unable to handle polysemy. We address these drawbacks in our method which takes advantage of a high coverage dictionary in an EM style training algorithm over monolingual corpora in two languages. Our model achieves state-of-the-art performance on bilingual lexicon induction task exceeding models using large bilingual corpora, and competitive results on the monolingual word similarity and cross-lingual document classification task.
              Bookmark
              • Record: found
              • Abstract: not found
              • Conference Proceedings: not found

              Mining multilingual topics from wikipedia

                Bookmark

                Author and article information

                Journal
                2016-12-21
                Article
                1612.07215
                00f29621-c74b-4478-ae89-9ff40b4d86b0

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                cs.CL

                Theoretical computer science
                Theoretical computer science

                Comments

                Comment on this article