Is My Prediction Arbitrary? Measuring Self-Consistency in Fair Classification

tags
Machine Learning

Notes

interpret this disagreement as a lack of confidence in the classification decision.

NOTER_PAGE: (1 0.4603896103896104 . 0.6890756302521008)

even if a model’s classification decisions satisfy a particular fairness metric, it is not necessarily the case the model is equally confi- dent in each prediction.

NOTER_PAGE: (1 0.7441558441558441 . 0.22605042016806723)

The classification of Individual 2 is arbitrary.

NOTER_PAGE: (1 0.7688311688311689 . 0.5621848739495798)

the learning process that produced these predictions is not sufficiently confident to justify assigning Individual 2 either decision outcome.

NOTER_PAGE: (1 0.7857142857142857 . 0.7302521008403362)
NOTER_PAGE: (2 0.10909090909090909 . 0.12100840336134455)

to reveal arbitrariness, we must examine distributions of possible models for a given learning process,

NOTER_PAGE: (2 0.18636363636363637 . 0.13277310924369748)

shocking insights concerning empirical reproducibility

NOTER_PAGE: (2 0.4974025974025974 . 0.34957983193277314)

Nearly one-quarter of test instances are effectively 50% self-consistent; they resemble Individual 2 in Figure 1, meaning that their predictions are essentially arbitrary.

NOTER_PAGE: (4 0.4103896103896104 . 0.638655462184874)

even though the 101 models exhibit relatively small average disparities

NOTER_PAGE: (4 0.4642857142857143 . 0.6361344537815127)

motivating claim: It is possible to come close to satisfying fairness metrics, while the learning process exhibits very different levels of confidence for the underlying classifications

NOTER_PAGE: (4 0.5162337662337663 . 0.5050420168067227)

while at first glance it may seem odd that our solution for arbitrariness is to not pre- dict, it is worth noting that we often would have predicted incorrectly on a large portion of the abstention set

NOTER_PAGE: (5 0.6714285714285714 . 0.7176470588235294)

Self-consistent ensembling with abstention

NOTER_PAGE: (5 0.7675324675324675 . 0.12184873949579833)

we typically do not find evidence of significant fairness metric violations. By accounting for arbitrariness, we observe close-to-fairness in nearly every task, without applying common fairness-improving interventions

NOTER_PAGE: (6 0.7909090909090909 . 0.37142857142857144)

might not generalize to larger datasets.

NOTER_PAGE: (6 0.8987012987012987 . 0.21596638655462186)

without using fairness-focused interventions.

NOTER_PAGE: (8 0.08961038961038961 . 0.5647058823529412)

alarming results: Almost all tasks and settings demonstrate close- to or complete statistical equality in fairness metrics, after accounting for arbitrariness

NOTER_PAGE: (8 0.41103896103896104 . 0.7201680672268908)

we do not find many tasks that exhibit high systematic arbitrariness and, even when we do, we can substantially improve it.

NOTER_PAGE: (8 0.4850649350649351 . 0.5033613445378151)

variance is undermining the reliability of conclusions in fair clas- sification experiments.

NOTER_PAGE: (8 0.5461038961038961 . 0.6184873949579832)

we advocate for a shift in thinking about in- dividual models to the distribution over possible models

NOTER_PAGE: (8 0.6435064935064935 . 0.1865546218487395)

By examining individual models, arbitrariness re- mains latent; when we account for arbitrariness in practice, most problems of empirical unfairness go away.

NOTER_PAGE: (8 0.7045454545454546 . 0.553781512605042)

findings contradict accepted truths in algorithmic fairness.

NOTER_PAGE: (8 0.7896103896103897 . 0.1277310924369748)

common formalisms for measuring fair- ness can lead to false conclusions about the degree to which such violations are happening in practice. Worse, they can conceal the tremendous amount of arbitrariness that should really be the issue of concern.

NOTER_PAGE: (8 0.8175324675324676 . 0.6302521008403362)

inherent analytical trade-off between fairness and accu- racy

NOTER_PAGE: (8 0.8201298701298702 . 0.10756302521008404)

disputes the practical relevance of this formulation

NOTER_PAGE: (8 0.8694805194805195 . 0.09327731092436975)

typically possible to achieve higher accuracy while retaining close-to-fairness

NOTER_PAGE: (8 0.8824675324675325 . 0.28403361344537814)

altogether, our results signal severe limits to prediction in social settings

NOTER_PAGE: (9 0.09480519480519481 . 0.11092436974789917)

our method performs reasonably well with respect to both fairness and accuracy metrics; however, arbitrariness is such a rampant problem, it is arguably unreasonable to assign these metrics much value in practice.

NOTER_PAGE: (9 0.1396103896103896 . 0.09243697478991597)