Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science
Emily M. Bender and Batya Friedman. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics 6: 587–604. https://doi.org/10.1162/tacl_a_00041
Summary
In an effort to mitigate the lack of transparent documentation in natural language processing (NLP), Bender and Friedman
propose "data statements," "a characterization of a dataset that provides context to allow developers and users to better understand how
experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected
in systems built on the software" (pg. 587). They recommend using statements in papers presenting new datasets and reporting experimental
work with datasets, and in documentation for NLP systems.
Key Concepts
Reasoning for Data Statements
Current practice in NLP (as of 2018) dataset documentation involves discussion of the annotation process and
a brief description of the underlying data source. Documentation is less detailed in papers using
datasets. The authors argue that better documentation around speakers, annotators, curators, and intended
stakeholders would help increase transparency and audit unfair systems. Data statements should thus accompany all
new datasets and every experimental paper.
Data Statement Schema
-
Curation Rationale: A statement of the rationale behind the curation of segments
in the dataset.
-
Language Variety: Describe the language with (1) a BCP-47 tag (e.g., en-US),
(2) a prose description of the language variety (e.g., "English as spoken in Palo Alto,
California" (pg. 590)).
-
Speaker Demographics: Since linguistic variation is found to depend on speaker
demographics, authors suggest including: "age, gender, race/ethnicity, native language,
socioeconomic status, number of different speakers represent, presence of disordered speech" (pg. 590).
-
Annotator Demographics: Since annotator background influences their experience with language, authors
suggest including: "age, gender, race/ethnicity, native language, socioeconomic status, training in linguistics/other relevant
discipline" (pg. 591).
-
Speech Situation: "Time and place, modality (spoken/signed, written), scripted/edited vs. spontaneous,
synchronous vs. asynchronous interaction, intended audience" (pg. 591).
-
Text Characteristics: Genre and topic.
-
Recording Quality
-
Other Relevant Information
Definitions
- Dataset
-
A collection of speech or writing (may have annotations).
-
Annotations
-
"Indications of linguistic structure like part of speech tags or syntatic parse trees, as well as
labels classifying aspects of what the speakers were attempting to accomplish eith there utterences ... Labels
can be naturally occurring, such as star ratings in reviews..." (pg. 588).
- Speaker
-
The individual who produced "some segment of linguistic behavior included in the dataset" (pg. 588)
- Annotator
-
People who label the raw data.
- Curator
-
Involved in the selection of which data to include in the dataset, including "creating search terms
that generate sets of documents, by selecting speakers to interview and designing interview
questions and so forth" (pg. 588).
- Stakeholders
-
As defined in Value Sensitive Design, stakeholders are
those impacted directly or indirectly by the system.
- Algorithm
-
The authors define an algorithm as both rule-based and machine learning in NLP.
- System
-
A piece of software that does NLP, inclding underlying models and user-facing products.
- Bias
-
The authors define bias as computer systems that "systematically and unfairly discriminate against
certain individuals or groups of individuals in favorite of others" (pg 589). They define
3 types of bias. (1) Pre-existing, from social institutions and practices. (2) Technical,
from technical constraints and decisions. (3) Emergent, when a system designed in one context
is deployed in another.
Further Reading
-
Batya Friedman, Peter H. Kahn, Alan Borning, and Alina Huldtgren. 2013.
Value Sensitive Design and Information Systems. Springer, Dordrecht, 55–95.
https://doi.org/10.1007/978-94-007-7844-3_4
-
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan,
Hanna Wallach, Hal Daumé, and Kate Crawford. 2018. Datasheets for Datasets.
http://arxiv.org/abs/1803.09010