Inverted Bilingual Topic Models for Lexicon Extraction from Non-parallel
  Data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

A good lexicon is an important resource for various cross-lingual tasks such as information retrieval and text mining. In this paper, we focus on extracting translation pairs from non-parallel cross-lingual corpora. Previous lexicon extraction algorithms for non-parallel data generally rely on an accurate seed dictionary and extract translation pairs by using context similarity. However, there are two problems. One, a lot of semantic information is lost if we just use seed dictionary words to construct context vectors and obtain the context similarity. Two, in practice, we may not have a clean seed dictionary. For example, if we use a generic dictionary as a seed dictionary in a special domain, it might be very noisy. To solve these two problems, we propose two new bilingual topic models to better capture the semantic information of each word while discriminating the multiple translations in a noisy seed dictionary. We then use an effective measure to evaluate the similarity of words in different languages and select the optimal translation pairs. Results of experiments using real Japanese-English data demonstrate the effectiveness of our models.

Related collections

Most cited references 5

Record: found
Abstract: not found
Conference Proceedings: not found

Unsupervised prediction of citation influences

Laura Dietz, Tobias Scheffer, Steffen Bickel (2007)

0 comments Cited 21 times – based on 0 reviews

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Learning Crosslingual Word Embeddings without Bilingual Corpora

Long V Duong, Hiroshi Kanayama, Tengfei Ma … (2016)

Crosslingual word embeddings represent lexical items from different languages in the same vector space, enabling transfer of NLP tools. However, previous attempts had expensive resource requirements, difficulty incorporating monolingual data or were unable to handle polysemy. We address these drawbacks in our method which takes advantage of a high coverage dictionary in an EM style training algorithm over monolingual corpora in two languages. Our model achieves state-of-the-art performance on bilingual lexicon induction task exceeding models using large bilingual corpora, and competitive results on the monolingual word similarity and cross-lingual document classification task.