Data Statements for NLP - Morgan Klaus Scheuerman

Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science

Emily M. Bender and Batya Friedman. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics 6: 587–604. https://doi.org/10.1162/tacl_a_00041

Summary

In an effort to mitigate the lack of transparent documentation in natural language processing (NLP), Bender and Friedman propose "data statements," "a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software" (pg. 587). They recommend using statements in papers presenting new datasets and reporting experimental work with datasets, and in documentation for NLP systems.

Key Concepts

Reasoning for Data Statements

Current practice in NLP (as of 2018) dataset documentation involves discussion of the annotation process and a brief description of the underlying data source. Documentation is less detailed in papers using datasets. The authors argue that better documentation around speakers, annotators, curators, and intended stakeholders would help increase transparency and audit unfair systems. Data statements should thus accompany all new datasets and every experimental paper.

Data Statement Schema

Curation Rationale: A statement of the rationale behind the curation of segments in the dataset.
Language Variety: Describe the language with (1) a BCP-47 tag (e.g., en-US), (2) a prose description of the language variety (e.g., "English as spoken in Palo Alto, California" (pg. 590)).
Speaker Demographics: Since linguistic variation is found to depend on speaker demographics, authors suggest including: "age, gender, race/ethnicity, native language, socioeconomic status, number of different speakers represent, presence of disordered speech" (pg. 590).
Annotator Demographics: Since annotator background influences their experience with language, authors suggest including: "age, gender, race/ethnicity, native language, socioeconomic status, training in linguistics/other relevant discipline" (pg. 591).
Speech Situation: "Time and place, modality (spoken/signed, written), scripted/edited vs. spontaneous, synchronous vs. asynchronous interaction, intended audience" (pg. 591).
Text Characteristics: Genre and topic.
Recording Quality
Other Relevant Information

Definitions

Dataset: A collection of speech or writing (may have annotations).
Annotations: "Indications of linguistic structure like part of speech tags or syntatic parse trees, as well as labels classifying aspects of what the speakers were attempting to accomplish eith there utterences ... Labels can be naturally occurring, such as star ratings in reviews..." (pg. 588).
Speaker: The individual who produced "some segment of linguistic behavior included in the dataset" (pg. 588)
Annotator: People who label the raw data.
Curator: Involved in the selection of which data to include in the dataset, including "creating search terms that generate sets of documents, by selecting speakers to interview and designing interview questions and so forth" (pg. 588).
Stakeholders: As defined in Value Sensitive Design, stakeholders are those impacted directly or indirectly by the system.
Algorithm: The authors define an algorithm as both rule-based and machine learning in NLP.
System: A piece of software that does NLP, inclding underlying models and user-facing products.
Bias: The authors define bias as computer systems that "systematically and unfairly discriminate against certain individuals or groups of individuals in favorite of others" (pg 589). They define 3 types of bias. (1) Pre-existing, from social institutions and practices. (2) Technical, from technical constraints and decisions. (3) Emergent, when a system designed in one context is deployed in another.