The General Data Protection Regulation (GDPR) goes into effect today, May 25th. Chances are you’ve heard a lot about it, but if you’re not familiar, essentially, GDPR is aimed at strengthening EU citizens’ privacy rights and giving them more control of their personal data. It applies to any company doing business in the EU or any company outside the EU that holds data from EU citizens.
What that means is right now, technology companies around the world are experiencing the single biggest ever cross-industry upgrade of technologies I can remember. Engineers are rushing to ensure data is encrypted at rest, that personally identifying information (PII) can be tracked and deleted, and making many other system-wide upgrades. We haven’t seen anything like this since Y2K bug, 18 years ago. But Y2K fizzled. Few people saw any glitches or penalties; in fact, my personal computer thought it was 1980 and that’s about the worst I remember. GDPR is different. The issues around data privacy are real and so are the penalties: 4% of annual revenue or €20 million, whichever is higher.
You may have seen some different behavior from tech companies you deal with already. For example, organizations are now taking greater efforts to delete all your personal data when you ask. Even if you’re not an EU citizen, you’ve likely noticed the change–after all, in some cases, it is simply easier or safer to implement these changes for all people, not just EU citizens.
One article in GDPR has sparked a particular debate in the machine learning community, namely Article 22 of GDPR: “Automated Individual Decision Making, Including Profiling.” Roughly paraphrased, Article 22 gives people the right to understand why a machine learning algorithm has made a given choice.
(For an overview of the argument within the machine learning community, including the two points of view from the legal perspective, I recommend starting with an article by Gregory Piatetsky’s the president of KDnuggets: Will GDPR Make Machine Learning Illegal?)
One thing is for certain though. What Article 22 means is that for the first time, machine learning is in the spotlight for widespread legislation to ensure that it is equitable. This is a good thing.
Why? Take this classic example: a machine learning model is being used for determining whether or not a person gets a home loan. It denies a loan to an immigrant based on their name because that name is more likely to be a name held by poorer people who defaulted on previous loans. It’s essentially a type of racial profiling that should be illegal (and is already is illegal in some countries for specific use cases). But this law may now broadly apply to any use case involving a Machine Learning decision, including medical diagnosis, self-driving cars, or even how your music application personalizes to your tastes.
Ensuring that AI can be used by everyone equally has always been the greatest passion in my work. When working in refugee camps in West Africa in the 2000’s, I saw that almost everyone had access to a cell phone, but that search engines and spam detection didn’t work in the local languages. AI technologies that we took for granted 20 years ago still don’t work in those languages today. That’s what motivated me to come to Silicon Valley and get a PhD at Stanford focused on how natural language processing can be adapted to less widely spoken languages in the context of disaster response and health.
As the EU comprises the most wealthy countries, it has the ability to focus on resources on problems like data privacy and protection. I believe that it is the obligation of wealthier nations to tackle hard new problems so that less wealthy nations can enjoy the same standards. The poorest people in the world are the most likely to be the victims of crimes and discrimination. Any protections will help them the most.
And while I would love to see the United Nations ratify much of GDPR as fundamental human rights, for now, I will share three important things about GDPR that have much deeper implications for the machine learning community than many people have realized.
1. Your data is as important as your algorithm.
It’s not enough to be able to say “my algorithm made a decision based on these features and weights from this labeled data.” You have to be able to say where the labeled data came from too.
Both the data and algorithms contribute to how most machine learning algorithms work. In brief, algorithms learn from labeled examples. So, a self-driving car can identify a pedestrian, because it has learned what a pedestrian looks like from data which has 1000s of human-labeled examples of pedestrians.
You can change the algorithm drastically, and the different algorithm will identify pedestrians almost the same. A difference of a few percents in accuracy is often a lot in the context of algorithms.
In contrast, if you change the data in even small ways, like labeling pedestrians only on bike paths, not roads, the performance will change massively for almost every algorithm.
Despite this real-world experience, an academic bias has dominated the debate: people have only worried about algorithms.
It started with Pedro Domingo, UW professor and author of ‘The Master Algorithm’, who tweeted:
Starting May 25, the European Union will require algorithms to explain their output, making deep learning illegal.
— Pedro Domingos (@pmddomingos) January 29, 2018
Firstly, I’m not sure it’s only deep learning algorithms. Even a linear model (more or less a one-layer neural network and therefore not ‘deep’) will come up with some pretty hard-to-interpret weights for each feature. It’s trying to create weights to discriminate between features, not weights that accurately represent the contribution of a given feature to a decision.
However, the real problem is the bias towards algorithms in academia and how this has dominated the debate. Put simply: the most interpretable algorithm in the world is useless if you can’t explain your training data.
- What data did you choose to label, and what did you omit?
- Who annotated the data, and, if there are subjective judgments that are subject to bias in those labels, what are the relevant demographics?
- How did you implement quality control and aggregate conflicting annotations?
99.9% of datasets contain errors or inconsistencies which will influence the decision. Without this, you can’t explain your ML decision.
2. Your source for transfer learning is as important as your algorithm
The majority of companies we work within computer vision are using some kind of transfer learning. This involves taking the model from one set of data and adapting it to another. The most popular architecture is to take a model trained on millions of images, like ImageNet, and then retrain only the final layer for a new image problem. This allows the model to take advantage of the edges, textures and other basic features in images discovered by the early layers of ImageNet across millions of images, and combine them for new kinds of image classification with fewer labeled data points.
If you want to explain your algorithm that uses transfer learning, you, therefore, have to explain the components taken from the earlier models.
Unfortunately, the data collection and strategy for much of ImageNet is lost. ImageNet was created with anonymous workers on Mechanical Turk, so we don’t know many demographics about the people and how that might have affected a label choice. And much of the post-processing and aggregation of data took place on servers or personal computers that no longer exist.
I remember being asked about annotation strategies when ImageNet was being created and I’m sure I recommended Mechanical Turk. I more clearly remember Jia Deng, now a professor at the University of Michigan, first presenting it to an internal AI group at Stanford in 2009 when we were both Ph. students there, and being really impressed. So, I’m not criticizing the decisions at the time, or at the very least, I would have made the same decisions and not considered the future implications of a dataset that couldn’t be audited. I also take no credit for ImageNet’s huge success, which goes solely to its creators. I was focused on NLP for classifying health messages sent in the Chichewa language at the time and briefly living in the Amazon to study the Matses language–about the furthest you can get from Computer Vision. I have used ImageNet a lot since and incorporate it into classes I teach, and even a coding task that I give to potential employees.
The fact remains, a dataset created for research purposes might not need to be auditable if it’s the only purpose is to benchmark different algorithms, but that dataset is probably not GDPR-compliant when used in the real world.
3. Your language bias is also a gender & racial bias
The final potential area for non-compliance is the inherent bias when AI services only a small number of languages, or perhaps only one.
The reality is: algorithms that are gender biased or racially biased are not GDPR compliant. Language plays a big part in this. Most language technologies only work in English, or only work well in English, even though English only accounts for 5% of conversations daily. 99% of research into Natural Language Processing is a handful of European languages predominantly spoken in the wealthiest countries, as well as Chinese and Japanese. This bias carries over into industry, too.
The spectrum of wealthy to less wealthy countries and languages also follows ethnic lines, so there is bias here even within the EU.
If you are raised female in much of the world, you are less likely to be taught English or another prestige language. In parts of the world, like the Amazon, your race can be determined by the language you speak more than your ethnicity.
I’d especially like to see more attention paid to the language biases in our Machine Learning as we solve real-world problems.
Why GDPR is a good thing for machine learning
I’m seeing every side of GDPR right now. My engineering team is ensuring that all our data and processes are compliant. My machine learning team is ensuring that we can produce auditable models for our customers. Our Product team is ensuring that training data created with our products can be fully auditable for quality control. If anyone has the right to complain in AI, it should be me. I have no complaints.
There will be some issues with GDPR compliance up front. That’s just something you can expect when new, widely applicable legislation goes into effect. (In fact, Google and Facebook are already facing lawsuits on this, the first day of the law’s validity.) For the machine learning community though, I’d argue that GDPR will be greatly beneficial. It will increase accountability and fairness. It will help solve certain instances of algorithmic bias. And it will force the community to better understand our data, how it’s collected, how it’s labeled, and how it’s used. All of that is a huge positive. Because if you believe in the promise of machine learning, these are the sorts of guardrails that will help make it more equitable, universally. That’s undeniably a good thing.