1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset

      Preprint
      ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries, the recent deep learning models for information retrieval perform poorly on these datasets. These models (e.g. DUET, Conv-KNRM) are trained and evaluated on data collected from commercial search engines not publicly available for academic research which is a problem for reproducibility and the advancement of research. In this paper, we propose WIKIR: an open-source toolkit to automatically build large-scale English information retrieval datasets based on Wikipedia. WIKIR is publicly available on GitHub. We also provide wikIR59k: a large-scale publicly available dataset that contains 59,252 queries and 2,617,003 (query, relevant documents)

          Related collections

          Author and article information

          Journal
          04 December 2019
          Article
          1912.01901
          9c6ddfad-27f2-49b7-bbc5-4ea19ba3ba64

          http://creativecommons.org/licenses/by-sa/4.0/

          History
          Custom metadata
          Being reviewed for the LREC 2020 conference
          cs.IR

          Information & Library science
          Information & Library science

          Comments

          Comment on this article