Risk Scores, Fairness, and Impossibility
/via https://fairmlclass.github.io/1.html#/4 |
Over the next few years, Deep Learning is going to get embedded in our daily lives in matters great and small, from self-driving cars (it’ll happen?!) to medical diagnosis, to, well, all sorts of stuff. As we go down this road though, we really need to think through what exactly the algorithms are telling us. And the reason for this is that the information that we get will, by definition, be biased in one form or the other.
Ok, “biased” a bit of a loaded term, but I use it advisedly. I could mean that these algorithms get used for fraud, but I don’t. What I do mean — which is actually worse — is that with risk-scoring, it is mathematically impossible to actually be “fair” across multiple groups!
(Before we go too much further, Risk Score is the likelihood that you possess some trait. For example, the odds that you are an football fan, or have glaucoma, or might rob a bank — each of these would be a risk score for that trait)
So, what does fair mean in this context, and why do we care? Well, let’s say you’re selling football jerseys — you’d want the viewer to see ads for the jerseys as long as the viewer is interested in football, regardless of whether they are men or women, right? If the ad platform said something like “meh, it’s a woman, we won’t show her an ad for your jersey”, then that is potential revenue down the tubes, right? And that is not fair to you!
The underlying point here is that if we want our algorithms to be fair, then we should make sure that they are Calibrated, and have Balance for the Positives and Negatives.
- • Calibration: If you define the likelihood of a group as having the trait is
X
, thenX
% of the group should have the trait. Fairness comes in when you make sure that the percentages are also the same for the individual groups (e.g. “Regardless of gender, as long as the interest level is the same, we show the person the ad”).
- • Balance for the Positives: The average risk-score for the positives (people with the trait) in each group should be the same. Fairness comes in because if the average positive risk-scores are different for the groups (e.g., men vs women), then it’s more likely that the algorithm will pick people from one of the groups (“we’ll only show ads for your football jerseys to men”).
- • Balance for the Negatives: The average risk-score for the negatives (people without the trait) in each group should be the same. Fairnesscomes in here exactly as with Balance for the Positives above.
Easy right? Just make sure that we meet the three criteria above, and we’re good to go, right?
Well, wrong.
In a fascinating paper (•) Kleinberg et al. show that it is mathematically impossible to satisfy all three of the above constraints, and that whatever you do, you will have tradeoffs.
Well, wrong.
In a fascinating paper (•) Kleinberg et al. show that it is mathematically impossible to satisfy all three of the above constraints, and that whatever you do, you will have tradeoffs.
It’s a lot of dense math, so I’ll leave it out of this post, but the result is deeply illuminating (and painful), that perfect fairness — as defined above — is impossible. As the authors put it
Suppose we want to determine the risk that a person is a carrier for a disease X, and suppose that a higher fraction of women than men are carriers. Then our results imply that in any test designed to estimate the probability that someone is a carrier of X, at least one of the following undesirable properties must hold:
(a) the test’s probability estimates are systematically skewed upward or downward for at least one gender; or
(b) the test assigns a higher average risk estimate to healthy people (non-carriers) in one gender than the other; or
(c ) the test assigns a higher average risk estimate to carriers of the disease in one gender than the other.
The point is that this trade-off among (a), (b), and (c ) is not a fact about medicine; it is simply a fact about risk estimates when the base rates differ between two groups.
To bring this back to where we started, the algorithms, by definition, are not going to be perfectly fair. Simply implementing them with the assumption that “the chips will fall where they may” is naive at best, and ill-intentioned at worst. We must take context into account when designing these algorithms!
Equally importantly, we must disclose the tradeoffs that have been made in the implementation. The disclosures will ensure that, at some level, there is transparency. Transparency alone is not enough, but at the very least it’s a start…
Equally importantly, we must disclose the tradeoffs that have been made in the implementation. The disclosures will ensure that, at some level, there is transparency. Transparency alone is not enough, but at the very least it’s a start…
(Mind you, all of the above is before we get into things like biases in data, cultural artifacts, sampling errors, and whatnot. For much more on this, read this excellent summary at Nature)
(•) “Inherent Trade-Offs in the Fair Determination of Risk Scores” — by Kleinberg et al.
Comments