1,532
views
0
recommends
+1 Recommend
1 collections
    4
    shares

      Celebrating 65 years of The Computer Journal - free-to-read perspectives - bcs.org/tcj65

      scite_
       
      • Record: found
      • Abstract: found
      • Conference Proceedings: found
      Is Open Access

      Distributional Lexical Semantics for Stop Lists

      proceedings-article
      ,
      BCS-IRSG Workshop on Corpus Profiling (IRSG)
      Workshop on Corpus Profiling
      18 October 2008
      Stop words, lexical distributional semantics, information retrieval
      Bookmark

            Abstract

            In this paper, we consider the use of techniques that lead naturally towards using distributional lexical semantics for the automatic construction of corpora-specific stop word lists. We propose and evaluate a method for calculating stop words based on collocation, frequency information and comparisons of distributions within and across samples. This method is tested against the Enron email corpus and the MuchMore Springer Bilingual Corpus of medical abstracts. We identify some of the data cleansing challenges related to the Enron corpus, and particularly how these necessarily relate to the profile of a corpus. We further consider how we can and should investigate behaviours of subsamples of such a corpus to ascertain whether the lexical semantic techniques employed might be used to identify and classify variations in contextual use of keywords that may help towards content separation in “unclean” collections: the challenge here is the separation of keywords in the same or very similar contexts, that may be conceived as a “pragmatic difference”. Such work may also be applicable to initiatives in which the focus is on constructing (clean) corpora from the web, deriving knowledge resources from wikis, and finding key information within other textual social media.

            Content

            Author and article information

            Contributors
            Conference
            October 2008
            October 2008
            : 1-11
            Affiliations
            [0001]University of Surrey
            Article
            10.14236/ewic/IRSG2008.5
            3e6754f4-e938-4734-a21f-201bf8fa0a67
            © Mr Neil Cooke et al. Published by BCS Learning and Development Ltd. BCS-IRSG Workshop on Corpus Profiling

            This work is licensed under a Creative Commons Attribution 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

            BCS-IRSG Workshop on Corpus Profiling
            IRSG
            London
            18 October 2008
            Electronic Workshops in Computing (eWiC)
            Workshop on Corpus Profiling
            History
            Product

            1477-9358 BCS Learning & Development

            Self URI (article page): https://www.scienceopen.com/hosted-document?doi=10.14236/ewic/IRSG2008.5
            Self URI (journal page): https://ewic.bcs.org/
            Categories
            Electronic Workshops in Computing

            Applied computer science,Computer science,Security & Cryptology,Graphics & Multimedia design,General computer science,Human-computer-interaction
            lexical distributional semantics,Stop words,information retrieval

            Comments

            Comment on this article