Datasheets for Datasets

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé, and Kate Crawford. 2018. Datasheets for Datasets. http://arxiv.org/abs/1803.09010

Summary

In an effort to mitigate the lack of dataset documentation, Gebru et al. propose "datasheets" inspired by more robust documentation standards in the electronics industry. Datasheets are meant to improve transparency and accountability and be useful to both dataset creators and dataset consumers.

Key Concepts

Proposed Benefits

Dataset Consumers
To encourage reflection on the process of creating, distrubuting, and maintaining a dataset (including benefits and harms).

Dataset Creators
To have information about the dataset to make more informed decisions about its use.

Questions

The authors offer guiding questions towards creating datasheets for datasets.

  • Motivations: Describe the motivations for creating the dataset, including funding, any specific tasks the authors had in mind, and who the authors are.
  • Composition: Describe the composition of the dataset, like what kinds of data are in it, how it was collected, whether labels are associated with the data, and whether the dataset contains sensitive information.
  • Collection Process: Describe the data collection process, like how the data was collected, where or who is was collected from, who was involved in the collection process, and, if people are involved, if consent was given for the data to be collected.
  • Processing: Whether the data was process or labelled and how it was done.
  • Uses: The tasks the dataset is intended to be used for, how it has already been used, and limitations of use.
  • Distribution: How the dataset will be distributed and to who, and any restrictions on distribution.
  • Maintenance: Who and how the dataset will be maintained, and if and how others will be able to build on it.