Blog
About

112
views
0
recommends
+1 Recommend
1 collections
    4
    shares
      • Record: found
      • Abstract: found
      • Conference Proceedings: found
      Is Open Access

      Distributional Lexical Semantics for Stop Lists

      ,

      BCS-IRSG Workshop on Corpus Profiling (IRSG)

      Workshop on Corpus Profiling

      18 October 2008

      Stop words, lexical distributional semantics, information retrieval

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          In this paper, we consider the use of techniques that lead naturally towards using distributional lexical semantics for the automatic construction of corpora-specific stop word lists. We propose and evaluate a method for calculating stop words based on collocation, frequency information and comparisons of distributions within and across samples. This method is tested against the Enron email corpus and the MuchMore Springer Bilingual Corpus of medical abstracts. We identify some of the data cleansing challenges related to the Enron corpus, and particularly how these necessarily relate to the profile of a corpus. We further consider how we can and should investigate behaviours of subsamples of such a corpus to ascertain whether the lexical semantic techniques employed might be used to identify and classify variations in contextual use of keywords that may help towards content separation in “unclean” collections: the challenge here is the separation of keywords in the same or very similar contexts, that may be conceived as a “pragmatic difference”. Such work may also be applicable to initiatives in which the focus is on constructing (clean) corpora from the web, deriving knowledge resources from wikis, and finding key information within other textual social media.

          Related collections

          Most cited references 5

          • Record: found
          • Abstract: not found
          • Article: not found

          The automatic identification of stop words

            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Identifying synonymous concepts in preparation for technology mining

              Bookmark
              • Record: found
              • Abstract: not found
              • Book Chapter: not found

              Pattern Mining Across Domain-Specific Text Collections

                Bookmark

                Author and article information

                Contributors
                Conference
                October 2008
                October 2008
                : 1-11
                Affiliations
                University of Surrey
                Article
                10.14236/ewic/IRSG2008.5
                © Mr Neil Cooke et al. Published by BCS Learning and Development Ltd. BCS-IRSG Workshop on Corpus Profiling

                This work is licensed under a Creative Commons Attribution 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

                BCS-IRSG Workshop on Corpus Profiling
                IRSG
                London
                18 October 2008
                Electronic Workshops in Computing (eWiC)
                Workshop on Corpus Profiling
                Product
                Product Information: 1477-9358BCS Learning & Development
                Self URI (journal page): https://ewic.bcs.org/
                Categories
                Electronic Workshops in Computing

                Comments

                Comment on this article