Datasheets for Datasets
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé, and Kate Crawford. 2018. Datasheets for Datasets. http://arxiv.org/abs/1803.09010
Summary
In an effort to mitigate the lack of dataset documentation, Gebru et al. propose "datasheets" inspired by more robust documentation
standards in the electronics industry. Datasheets are meant to improve transparency and accountability and be useful
to both dataset creators and dataset consumers.
Key Concepts
Proposed Benefits
- Dataset Consumers
-
To encourage reflection on the process of creating, distrubuting, and maintaining a dataset (including benefits and harms).
-
Dataset Creators
-
To have information about the dataset to make more informed decisions about its use.
Questions
The authors offer guiding questions towards creating datasheets for datasets.
-
Motivations: Describe the motivations for creating the dataset, including funding, any specific tasks
the authors had in mind, and who the authors are.
-
Composition: Describe the
composition of the dataset, like what kinds of data are in it, how it was collected, whether labels are associated
with the data, and whether the dataset contains sensitive information.
-
Collection Process: Describe the data collection process, like how the data was collected, where or who is was collected
from, who was involved in the collection process, and, if people are involved, if consent was given for the data to
be collected.
-
Processing: Whether the data was process or labelled and how it was done.
-
Uses: The tasks the dataset is intended to be used for, how it has already been used, and limitations of use.
-
Distribution: How the dataset will be distributed and to who, and any restrictions on distribution.
-
Maintenance: Who and how the dataset will be maintained, and if and how others will be able to build on it.