Inviting an author to review:
Find an author and click ‘Invite to review selected article’ near their name.
Search for authorsSearch for similar articles
6
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Recent advances in text mining have provided new methods for capitalizing on the voluminous natural language text data created by organizations, their employees, and their customers. Although often overlooked, decisions made during text preprocessing affect whether the content and/or style of language are captured, the statistical power of subsequent analyses, and the validity of insights derived from text mining. Past methodological articles have described the general process of obtaining and analyzing text data, but recommendations for preprocessing text data were inconsistent. Furthermore, primary studies use and report different preprocessing techniques. To address this, we conduct two complementary reviews of computational linguistics and organizational text mining research to provide empirically grounded text preprocessing decision-making recommendations that account for the type of text mining conducted (i.e., open or closed vocabulary), the research question under investigation, and the data set’s characteristics (i.e., corpus size and average document length). Notably, deviations from these recommendations will be appropriate and, at times, necessary due to the unique characteristics of one’s text data. We also provide recommendations for reporting text mining to promote transparency and reproducibility.

          Related collections

          Most cited references184

          • Record: found
          • Abstract: found
          • Article: not found

          False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant.

          In this article, we accomplish two things. First, we show that despite empirical psychologists' nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Identifying careless responses in survey data.

              When data are collected via anonymous Internet surveys, particularly under conditions of obligatory participation (such as with student samples), data quality can be a concern. However, little guidance exists in the published literature regarding techniques for detecting careless responses. Previously several potential approaches have been suggested for identifying careless respondents via indices computed from the data, yet almost no prior work has examined the relationships among these indicators or the types of data patterns identified by each. In 2 studies, we examined several methods for identifying careless responses, including (a) special items designed to detect careless response, (b) response consistency indices formed from responses to typical survey items, (c) multivariate outlier analysis, (d) response time, and (e) self-reported diligence. Results indicated that there are two distinct patterns of careless response (random and nonrandom) and that different indices are needed to identify these different response patterns. We also found that approximately 10%-12% of undergraduates completing a lengthy survey for course credit were identified as careless responders. In Study 2, we simulated data with known random response patterns to determine the efficacy of several indicators of careless response. We found that the nature of the data strongly influenced the efficacy of the indices to identify careless responses. Recommendations include using identified rather than anonymous responses, incorporating instructed response items before data collection, as well as computing consistency indices and multivariate outlier analysis to ensure high-quality data.
                Bookmark

                Author and article information

                Contributors
                (View ORCID Profile)
                Journal
                Organizational Research Methods
                Organizational Research Methods
                SAGE Publications
                1094-4281
                1552-7425
                January 2022
                November 23 2020
                January 2022
                : 25
                : 1
                : 114-146
                Affiliations
                [1 ]Purdue University College of Health and Human Sciences, West Lafayette, IN, USA
                [2 ]Independent
                [3 ]University of Iowa, Computer Science, Iowa City, IA, USA
                Article
                10.1177/1094428120971683
                a3cbb243-3f20-42e5-8ada-a9446b25523f
                © 2022

                http://journals.sagepub.com/page/policies/text-and-data-mining-license

                History

                Comments

                Comment on this article