5
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      A fast and flexible architecture for very large word n-gram datasets

      Natural Language Engineering
      Cambridge University Press (CUP)

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          This paper presents TrendStream, a versatile architecture for very large word n-gram datasets. Designed for speed, flexibility, and portability, TrendStream uses a novel trie-based architecture, features lossless compression, and provides optimization for both speed and memory use. In addition to literal queries, it also supports fast pattern matching searches (with wildcards or regular expressions), on the same data structure, without any additional indexing. Language models are updateable directly in the compiled binary format, allowing rapid encoding of existing tabulated collections, incremental generation of n-gram models from streaming text, and merging of encoded compiled files. This architecture offers flexible choices for loading and memory utilization: fast memory-mapping of a multi-gigabyte model, or on-demand partial data loading with very modest memory requirements. The implemented system runs successfully on several different platforms, under different operating systems, even when the n-gram model file is much larger than available memory. Experimental evaluation results are presented with the Google Web1T collection and the Gigaword corpus.

          Related collections

          Most cited references4

          • Record: found
          • Abstract: not found
          • Article: not found

          Trie memory

            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Web-based models for natural language processing

              Bookmark
              • Record: found
              • Abstract: not found
              • Conference Proceedings: not found

              Large-Scale Distributed Language Modeling

                Bookmark

                Author and article information

                Journal
                Natural Language Engineering
                Nat. Lang. Eng.
                Cambridge University Press (CUP)
                1351-3249
                1469-8110
                January 2013
                January 10 2012
                January 2013
                : 19
                : 1
                : 61-93
                Article
                10.1017/S1351324911000349
                23c737e6-e3a4-486f-a6cd-87f28d32d191
                © 2013

                https://www.cambridge.org/core/terms

                History

                Comments

                Comment on this article