ScienceOpen: research and publishing network

For Researchers

Search
Advanced search

4

views

    

0

recommends

0

shares

Record: found
Abstract: found
Article: found

Is Open Access

Unifying Human and Statistical Evaluation for Natural Language Generation

Preprint

Author(s): Tatsunori B. Hashimoto , Hugh Zhang , Percy Liang

Publication date Created: 04 April 2019

Read this article at

ScienceOpen ArXiv

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

How can we measure whether a natural language generation system produces both high quality and diverse outputs? Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. On the other hand, statistical evaluation (i.e., perplexity) captures diversity but not quality, as models that occasionally emit low quality samples would be insufficiently penalized. In this paper, we propose a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated. We demonstrate that this error rate can be efficiently estimated by combining human and statistical evaluation, using an evaluation metric which we call HUSE. On summarization and chit-chat dialogue, we show that (i) HUSE detects diversity defects which fool pure human evaluation and that (ii) techniques such as annealing for improving quality actually decrease HUSE due to decreased diversity.

Related collections

Most cited references 13

Record: found
Abstract: not found
Book Chapter: not found

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael R Maire, Serge Belongie … (2014)

0 comments Cited 2223 times – based on 0 reviews

Record: found
Abstract: not found
Conference Proceedings: not found

Generating Sentences from a Continuous Space

Luke Vilnis, Samy Bengio, Samuel R. Bowman … (2016)

0 comments Cited 139 times – based on 0 reviews

Record: found
Abstract: not found
Conference Proceedings: not found

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Ryan Lowe, Mike Noseworthy, Laurent Charlin … (2016)

0 comments Cited 86 times – based on 0 reviews

Author and article information

Journal

Publication date Created: 04 April 2019

Article

ArXiV ID: 1904.02792

SO-VID: a688bf2e-41c8-4f76-9def-6e268190b1e9

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments NAACL Camera Ready Submission

Categories cs.CL cs.AI stat.ML

ScienceOpen disciplines: Theoretical computer science,Machine learning,Artificial intelligence

Data availability:

ScienceOpen disciplines: Theoretical computer science, Machine learning, Artificial intelligence

Comments

Comment on this article