I’ll never forget my “aha” moment with bias in AI. I was working at IBM as the product owner for Watson Visual Recognition and we were planning to launch an improvement to the API. It was the fall of 2016, and we knew that the API wasn’t the best in class at returning “accurate” tags for images and we needed to improve it, but it also wasn’t clear how to measure “better” in accuracy for general tagging of images–the most obvious data set to test on was ImageNet, an academic dataset–but business was different. I was incredibly focused on answering the question: “How did we know the new system was better than the old system? What did “better” mean”?
I was also nervous. Specifically, I was nervous about the possibility of bias creeping into our models, the exact sort of problem the machine learning community has seen time and time again, poor facial recognition of diverse individuals to an AI beauty pageant that went awry and countless other instances of data-driven bias that results in both bad press and bad functionality. So we decided to look into the labels and make sure there wasn’t anything objectionable in there and, at first blush, everything seemed fine, with tens of thousands of labels for dogs and cats and plants and coffee mugs.
A couple of weeks prior to launch – one of the researchers on our team brought something to my attention. One of the image classes that had trained our model was “loser.” And a lot of those images were depicting people with disabilities.
I was horrified. Obviously, that data wouldn’t do. We couldn’t have that in our model. And then we started wondering, “what else have we overlooked?” Who knows what seemingly innocuous label might train our model to exhibit inherent or latent bias? We gathered everyone we could–from engineers to data scientists to marketers to, really, anyone who was willing to help–to comb through the tens of thousands of labels and millions of associated images and pull out everything we found objectionable according to IBM’s code of conduct. It was well worth doing – we pulled out more than a handful of other classes which didn’t reflect our values.
My “aha” moment helped avert a crisis. But I also realized that we had some advantages in doing so. For one, we had a diverse team (different ages, races, ethnicities, geographies, experience, etc.) and a shared understanding of what was and wasn’t objectionable. We also had the time, support, and the resources to both look for objectionable labels and to fix them.
That said, even with those advantages, there are some things we could have done better. To start, we had used an academic dataset as one of our training sets and tried to apply it to a business process. That’s a big rookie mistake. We also didn’t have a concrete plan to leverage human-in-the-loop processes to fix bad labels or tell our visual recognition API, “no, this is wrong.” And we were targeting “enterprises” generally. We didn’t have a narrow-enough audience and market so we tried to build something that would work for everyone, instead of honing in on something that would excel for a smaller use case
There are plenty of blogs and news stories out there about instances of algorithmic bias. But the number of articles that teach you how to mitigate against bias creeping into your machine learning projects are far fewer. I’d like to share a few tips I’ve learned not just from my “aha” moment at Watson, but from the time I’ve spent at Figure Eight seeing the raw ingredients that train, test, and tune the machine learning models of tomorrow.
It’s best to start by defining the decision you’re asking your model to solve for. As I mentioned above, with Watson, because our problem was very broad, we used very broad training data, and that led directly to some problems. But that’s just the start. Ask yourself:
- Is your AI making decisions that are usually made by humans? If so, there could be biases in that set of decisions you’ll need to carefully look at.
- Is the decision simple, with black and white answers? Because often, when you peel back the onion and look at real-world examples, there is context you might’ve missed. Think about the “loser” label I referenced above. We couldn’t have predicted that, but it was in our data regardless.
- Is there a protected class or status involved in this decision? Some cases are easy (is this a specific crop or a weed?). Some are harder (is this clothing intended for a man or a woman?). Some can be hidden (is the bias in the data directly? Think here about sentencing algorithms that may include demographic data that’s actually masked as geography data).
Those questions can help you deduce what might go wrong before you start building your model. Next, you’ll want to define the attributes upon which you’d like decisions to be made.
For example, if you’re creating a computer vision model that’s answering a fairly straight-forward question like “is this a human?” you need to actually define what you mean by “human.” Do cartoons count? What about court sketches? What if the person is partially occluded? Should a torso count as “human” for your model? What about just a hand? This all matters. You need clarity on what “human” means for this model. If you’re unsure, just ask people the same question about your data. You might be surprised by the ambiguities present and the assumptions you made going in.
At this point, you should know both what you’re solving for and what could go wrong. In essence, you should know “what is this thing we’re building?” and “what are some things that might go wrong for our end user?” Once you have a framework here, it’s of paramount importance to deeply review your data.
After all, this is where bias is often hidden. A few years back, researchers at the University of Washington & the University of Maryland found that doing an image search for certain jobs revealed serious underrepresentation and bias in results. Search “nurse,” for example, and you’d see only women. Search “CEO” and it’s all men. The search results were accurate in certain ways–the pictures were indeed of nurses and CEOs–but they painted a world in which those jobs were uniformly held by women or men, respectively. This is just one example, but it shows how bias can lurk in data without you being able to readily identify it.
You need to think about these issues when you’re reviewing your data. It’s one of the reasons why having a diverse team involved is crucial. Diverse backgrounds help ensure that your team will be asking different questions, thinking about different end users, and, hopefully, creating a technology with some empathy in mind.
A few things that are must-asks:
- Where did you get your data from? Could there be sourcing bias? A facial recognition model built in a college lab with collegiate faces might have issues with children or the elderly, for example.
- Do you have enough examples of edge cases? Without them, your model will have trouble identifying unlabeled examples that are outside the norm. Showing a model more male nurses and more female CEOs will help it to spin out less biased results.
- Are you thinking about your end user enough? While I realize I mentioned this above, it’s really crucial. Don’t just think of you as an end user; think of as many end users as possible for your project. Test with those end users. Doing so will help you find the solvable problems now before your model is in production.
At this point, hopefully, two themes are crystalizing here. One is that you need to define your problem and your end users carefully and plan for your outcomes. The second is that a lot of potential issues can be solved with careful attention to your training data. And it’s that second point that Figure Eight is uniquely qualified to talk about and solve for.
After all, machine learning models learn from data. Good data makes good models, bad data makes bad models, and biased data makes biased models. At Figure Eight, we’re in a position where we get to actually see how smart, innovative companies are solving for bias or unfairness in their model. In fact, the steps customers take to tune models to remove bias is directly analogous to how a customer tunes a model to account for changing business conditions or algorithmic uncertainty, generally. It all boils down to getting better data.
Take a content moderation we saw recently on the platform, expressly about annotating hate speech terms. Labeling these terms gives an NLP model the data it needs to more easily identify uncivil language, and a model that understands that can clean up almost any written communication. Essentially though, you’re teaching a model to understand new things–in this case, hate speech. You’d run through a very similar workflow if you were teaching a model to comprehend a new pop culture reference or product. Obviously, identifying and combating hate speech is a more important problem, but the underlying practices are virtually indistinguishable: labeling data to improve the performance of a model.
Knowing what I know now, I’d argue it’s both negligent and reckless to launch an AI system into a production without accounting for bias with some basic best practices. Essentially, they boil down to:
- Be transparent and open regarding what data trained the system, where it was collected, how it was labeled, what the benchmark for accuracy was, and how that’s measured.
- Declare the purpose of the decision making and the criteria through which that decision is made.
- Be empathetic. Understand that you will have different end users and they’ll all use your system differently. Imagine what their experiences might be and build for those, in addition to the ones you inherently expect.
- Take feedback! Ensure there is a mechanism to request question an answer, get a human judgement, or gracefully fall-back in low-confidence situations to not be overly-reliant on an AI system. Just like humans, it’s ok for the robot to say “I’m not sure.”
- Learn! When an outcome is questioned, ensure there is a way to give feedback, retrain, and ensure that the model is actively learning from new examples and real-world data.
Remember: it’s not impossible to reduce unwanted bias in your models. It takes some grit and hard work, sure, but it reduces down to being empathetic, iterating throughout the model building and tuning processes, and taking great care with your data.
As a cheat sheet. Here are 8 ways to prevent bias in your data. Keep in mind that your data can be biased even though you aren’t, however it’s possible to avoid and remedy.
Acknowledgments: I’d like to thank the many esteemed colleagues I’ve been privileged to work with who have taught me so much over the years. Specifically: John Smith, Matthew Hill, Rama Akkiraju, Ruchir Puri, John Schumacher, Rob High, Jeff Jablonowski, Jennifer Prendki, Robert Munro, Robin Bordoli, Vibha Sinha, Tara Lemmy, Shay Strong, Lukas Biewald, and Barney Pell.