Distributional Lexical Semantics for Stop Lists

In this paper, we consider the use of techniques that lead naturally towards using distributional lexical semantics for the automatic construction of corpora-specific stop word lists. We propose and evaluate a method for calculating stop words based on collocation, frequency information and comparisons of distributions within and across samples. This method is tested against the Enron email corpus and the MuchMore Springer Bilingual Corpus of medical abstracts. We identify some of the data cleansing challenges related to the Enron corpus, and particularly how these necessarily relate to the profile of a corpus. We further consider how we can and should investigate behaviours of subsamples of such a corpus to ascertain whether the lexical semantic techniques employed might be used to identify and classify variations in contextual use of keywords that may help towards content separation in “unclean” collections: the challenge here is the separation of keywords in the same or very similar contexts, that may be conceived as a “pragmatic difference”. Such work may also be applicable to initiatives in which the focus is on constructing (clean) corpora from the web, deriving knowledge resources from wikis, and finding key information within other textual social media.

Content

Author and article information

Contributors

Mr Neil Cooke

Dr Lee Gillam

Conference

Publication date: October 2008

Publication date (Print): October 2008

Pages: 1-11

Affiliations

[0001]University of Surrey

Article

DOI: 10.14236/ewic/IRSG2008.5

SO-VID: 3e6754f4-e938-4734-a21f-201bf8fa0a67

License:

This work is licensed under a Creative Commons Attribution 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Conference name: BCS-IRSG Workshop on Corpus Profiling

Conference acronym: IRSG

Conference number:

Conference location: London

Conference date: 18 October 2008

Conference sponsor: Electronic Workshops in Computing (eWiC)

Conference theme: Workshop on Corpus Profiling

History

Product

1477-9358 BCS Learning & Development

Self URI (article page): https://www.scienceopen.com/hosted-document?doi=10.14236/ewic/IRSG2008.5

Self URI (journal page): https://ewic.bcs.org/

Celebrating 65 years of The Computer Journal - free-to-read perspectives - bcs.org/tcj65

Distributional Lexical Semantics for Stop Lists

Abstract