From categories to gradience: Auto-coding sociophonetic variation with random forests

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The time-consuming nature of coding sociophonetic variables that are typically treated as categorical represents an impediment to addressing research questions around these variables that require large volumes of data. In this paper, we apply a machine learning method, random forest classification ( Breiman, 2001), to automate coding (categorical prediction) of two English sociophonetic variables traditionally treated as categorical, non-prevocalic /r/ and word-medial intervocalic /t/, based on tokens’ acoustic signatures. We found good performance for binary classifiers of non-prevocalic /r/ (Absent versus Present) and medial /t/ (Voiced versus Voiceless), but not for medial /t/ with a six-way coding distinction (largely due to some codes being sparsely represented in the training data). This method also yields rankings of acoustic measures in terms of importance in classification. Beyond any individual measures, this method generates probabilistic predictions of variation (classifier probabilities) that represent a composite of the acoustic cues fed into the model. In a listening experiment, we found that not only did classifier probabilities significantly capture gradience in trained listeners’ perceptions of rhoticity, they better predicted listeners’ perceptions than individual acoustic measures. This method thus represents a new approach to reconciling the categorical and continuous dimensions of sociophonetic variation.

Related collections

Most cited references 73

Record: found
Abstract: not found
Article: not found

Random Forests

Leo Breiman (2001)

0 comments Cited 6805 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Is Open Access

Fitting Linear Mixed-Effects Models Using lme4

Douglas M Bates, Martin Mächler, Ben Bolker … (2015)

Maximum likelihood or restricted maximum likelihood (REML) estimates of the parameters in linear mixed-effects models can be determined using the lmer function in the lme4 package for R. As for most model-fitting functions in R, the model is described in an lmer call by a formula, in this case including both fixed- and random-effects terms. The formula and data together determine a numerical representation of the model from which the profiled deviance or the profiled REML criterion can be evaluated as a function of some of the model parameters. The appropriate criterion is optimized, using one of the constrained optimization functions in R, to provide the parameter estimates. We describe the structure of the model, the steps in evaluating the profiled deviance or REML criterion, and the structure of classes or types that represents such a model. Sufficient detail is included to allow specialization of these structures by users who wish to write functions to fit specialized linear mixed models, such as models incorporating pedigrees or smoothing splines, that are not easily expressible in the formula language used by lmer. Journal of Statistical Software, 67 (1) ISSN:1548-7660

0 comments Cited 2194 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Collinearity: a review of methods to deal with it and a simulation study evaluating their performance

Carsten F Dormann, Jane Elith, Sven Bacher … (2013)

0 comments Cited 1346 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Dan Villarreal:

ORCID: http://orcid.org/0000-0002-6070-1138

Journal

Journal ID (issn): 1868-6354

Title: Laboratory Phonology: Journal of the Association for Laboratory Phonology

Publisher: Ubiquity Press

ISSN (Electronic): 1868-6354

Publication date (Electronic, pub): 10 June 2020

Publication date Collection: 2020

Volume: 11

Issue: 1

Electronic Location Identifier: 6

Affiliations

[1 ]Department of Linguistics, University of Pittsburgh, Pittsburgh, PA, US

[2 ]New Zealand Institute of Language, Brain and Behaviour, University of Canterbury, Christchurch, NZ

[3 ]Department of Linguistics, University of Canterbury, Christchurch, NZ

Author information

Dan Villarreal http://orcid.org/0000-0002-6070-1138

Lynn Clark http://orcid.org/0000-0003-3282-6555

Jennifer Hay http://orcid.org/0000-0001-8127-0413

Kevin Watson http://orcid.org/0000-0002-2341-0921

Article

DOI: 10.5334/labphon.216

SO-VID: ed429372-1ab7-474e-bc0c-00f6d9bd1993

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See http://creativecommons.org/licenses/by/4.0/.

History

Date received : 24 July 2019

Date accepted : 11 April 2020

Comments

Comment on this article

scite_

Cited by 1

Considering Performance in the Automated and Manual Coding of Sociolinguistic Variables: Lessons From Variable (ING)
Authors: Tyler Kendall, Charlotte Vaughn, Charlie Farrington …

See all cited by

Most referenced authors 829

See all reference authors

- Version 1
- Version 1

From categories to gradience: Auto-coding sociophonetic variation with random forests

Read this article at

Abstract

Related collections

Laboratory Phonology

Most cited references 73

Random Forests

Fitting Linear Mixed-Effects Models Using lme4

Collinearity: a review of methods to deal with it and a simulation study evaluating their performance

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Categories

Comments

Comment on this article

Similar content 26

Cited by 1

Most referenced authors 829