Measurement and Fairness

Jacobs, A.Z. and Wallach, H. 2019. Measurement and Fairness.

This paper "introduce[s] the language of measurement modeling from the quantitative social sciences as a framework for understanding fairness in computational systems" (pg. 1). A measurement model is defined as "a statistical model that links unobservable theoretical constructs, operationalized as latent variables, and data about the world" (pg. 2). They argue that computer systems are designed to measure unmeasurable attributes (e.g., credit worthiness, risk to society, work quality) through inferred observable properties, which may introduce mismatches between the construct itself (quality) and its operationalization and output. They argue that the harms in fair ML often stem from such mismatches. They further posit that disagreements around operationalizing fairness definitions are more often disagreements about the theoretical construct of fairness itself. Collapsing the differences between a theoretical construct and measurement can mask historical injustice. Measurement modeling is put forth as a tool for "testing assumptions about unobservable theoretical constructs, thereby making it easier to identify, characterize, and even mitigate fairness-related harms" (pg. 1).

Measurement Modeling

The authors present examples of measurement modeling, often used in social science fields like psychology and education to measure otherwise abstract and unobservable constructs. They begin with more "simple" constructs (height) and get increasingly more abstract to demonstrate how to use measurement modeling.

Representational Measurements

Representational measurements are "representing physical objects and their relationships by numbers" (e.g., a ruler being used to measure a unit of height, like the height of a human) (pg. 4). They point out that even something which seems as straightforward as measuring height is confounded by a number of definitional constraints. If height is defined as the length of a person from the bottom of the feet to the top of the head, a number of questions arise. Does one include hair in height? What about those without legs, or in a wheelchair? What if the person has a slouch? Further, tools present confounding variables. For example, the angle of the ruler, granularity of measurement marks on that ruler, and human errors in measurement all add some level of noise to each measurement. Errors are often accounted for in equations by assuming errors are statistically unbiased, present small variance, and are normally distributed (measurement error models). However, measurement errors are not necessarily "well-behaved" and " be correlated with sensitive attributes, such as race or gender" (pg. 4). For example, studies have shown that self-reporting height on data apps are more erroneous for men, who over-estimate their height.

Pragmatic Measurements

Pragmatic measurements are used for constructs that are inherently unobservable (e.g., socioeconomic status). They are designed to capture data about aspects of the underlying unobservable phenomenon. For example, and observative property like "income" may be used to infer socioeconomic status. In operationalizing income for socioeconomic status, and also a measurement error model, "we are making our assumptions about the relationships between the unobservable theoretical constructs of interest and the observed data explicit" (pg. 5). Further, there are other ways of measuring socioeconomic status beyond income, including: "years of education, location of residence, wealth, or occupation ... [and] other indicators drawn from observed properties, such as online purchasing behavior or group affiliations" (pg. 5).

Topic Modeling

Topic modeling is an interesting case because they are "unobservable theoretical constructs that are indirectly evidenced" that involve inferring topics from observable data like words (pg. 5). With topic modeling, there is an implicit assumption there is no measurement error.

Evaluating Measurement Models

The authors argue that the assumptions "about the relationships between unobservable theoretical constructs of interest, their operationalizations, and the observed data" must be evaluated before relying on the measurements (pg. 6). Social scientists employ two forms of evaluation: construct validity (is it the right construct?) and construct reliability (is it able to be repeated?). These are furthered bolstered by interpretation (what does it mean?) and application (does it work as expected/intended?). The authors argue that "the language of measurement, with the tools of construct validity and reliability, provide a concrete framework to assess whether, and how, operationalizations are useful matches for the construct they try to measure" (pg. 6).

Construct Validity

Definition: "the process of showing that an operationalization of an unobservable theoretical construct is meaningful and useful" (pg. 7).

To examine the quality of a measurement construct, one must ask the following: Is the measurement centered around the construct of interest in a systematic manner? Do the measurements capture every relevant facet of the construct? Do measurements behave as expected and do we know why or why not? Do measurements vary in ways that might suggest we have captured inintended variables? Do the measurements help us answer meaningful questions? What are the social consequences of using these measurements?

Measuring validity is not a simple binary, but a matter of critical reasoning and interrogating assumptions. The authors present a framework for assessing construct validity that involves seven components, synthesized from a variety of social science approaches: (1) face validity; (2) content validity; (3) convergent validity; (4) discriminant validity; (5) predictive validity; (6) hypothesis validity; and (7) consequential validity.

Face Validity

Whether the measurements produced by a construct appear plausible given the researcher's or practioner's expertise. It is inherently subjective and is a first step in assessing construct validity.

Content Validity

Whether the measurement model captures everything we believe relevant. This involves two types of agreement: (1) an agreed upon definition of the theoretical construct itself and (2) agreement between that definition and its operationalization. If there is a lack of agreement, we can still state our assumptions about the definition settled on. "Establishing content validity ensures that there is a substantive match between the observed world being measured and all relevant aspects of the construct" (pg. 8).

Convergent Validity

Whether measurement outputs are closely related to any existing measurements of the same construct where validity has already been established. "This type of validity is explored more quantitatively, but can reveal qualitative differences between operationalizations" (pg. 9).

Discriminant Validity

If our measurements capture other constructs, we must also assess the degree they are related to the construct of interest. Their measurements should be related (whereas the previous three assessments "confirm that our measurements wholly capture the intended construct" (pg. 9)). If two constructs are entirely unrelated, their measurements would have zero correlation.

Predictive Validity

Whether measurements are related to other external properties that were expected to influence our measurement. There are quantitative approaches ("do our measurements correlated with related properties?") and qualitative approaches ("do our measurements vary with properties we expect them to?") (pg. 10). The focus is examining properties more broadly to see how they may or may not be related to our measurement construct. It is mostly "about showing that our constructs follow expected relationships with properties that were not explicitly included in the model" (pg. 10).

Hypothesis Validity

Whether or not the construct has been operationalized in a theoretically meaningful and useful way. This is helped established in convergent validity, so we can show that our operationalization is useful in comparison to others, particularly to test new hypotheses and ask new questions. Hypothesis validity can be established by replicating past construct measurements and establishing that measures of our construct are relevant to measurements of others in expected ways.

Consequential Validity

Whether the construct should ever be used, regardless of how well it is operationalized. It is focused on downstream social impacts. In some cases, the measurement may be misleading due to social context (e.g., students from lower income households needing to take summer jobs, while students from higher income households can take unpaid internships; income would reflect inaccurately for the students).

Reliability

While a measure must be valid, it must also be reliable; measured confounded by imprecise measurement, meaningless scales, instability, or inference are not useful. Rerunning a model should yield similar results. In computational modeling, "a lack of reliability could emerge from numerical instability; a failure of the model to converge; strong dependence on particular physical processes, random seeds, or implementation" (pg. 11-12). Sensitivity of the model to outliers, small amounts of noise, and implementation may impact reliability and are often unreported in publication.

Disagreements in Fairness Constructs

The authors review how fairness, as an inherently contested construct, have led to disagreements on recent fairness issues in ML.

Parity- and Calibration-Based Fairness

A major disagreement has been around the COMPAS risk assessment tool. ProPublica argued that the tool was unfair in that it lack parity: it falsely labelled Black defendents as more risky at a higher rate than it did for white ones. Parity would argue that "rates of error and potential consequences must be the same across groups" (pg. 16). Northpointe, who developed the tool, argued their algorithm had calibration-based fairness - "for the same risk score, outcomes should be the same across groups" (pg. 16). The authors state "the disagreement between these two operationalizations yields challenges to convergent validity (measures are misaligned) and content validity (each capturing different theoretical understandings of fairness)" (pg. 16). Parity and calibration are theoretically in contradiction with one another.

Individual and Group Fairness

Individual fairness approaches believe similar individuals should be treated similarly. Group fairness approaches believe that groups are classified similarly - individuals within one group are treated similarly to those in the same group, but may be treated differently than those in another group. Operationalizing these are often incompatible with one another, as they employ different theoretical views of fairness.

Outcomes of Fairness: Justice, Due Process, Distribution, Equal Opportunity, Etc.

"Measurements of fairness that do not account for many of the different philosophical, legal, economic and practical notions of the theoretical construct of “fairness” necessarily lack content validity. Operationalizations of fairness that fail to account for due process and concepts of justice, distributive or otherwise, represent deep threats to consequential validity" (pg. 17). The construct of an outcome like "equal opportunity" is difficult to operationalize, and conflating operationalization with construct makes it more difficult to assess harms stemming from issues of validity.

Sensitive Attributes

Attributes operationalized for, like race and gender, are essential to fairness and also themselves contested topics - across culture, geography, and time. Inferring attributes like race and gender come with underlying assumptions about agreed upon definitions of what they are and how they can be inferred.