Organizations build machine learning systems so that they can predict and categorize data. But to get a system to do anything, you have to train it. This post is meant to help you figure out a budget for training data based on best practices.
I’ll share specific numbers, but at the outset, I have to say something unsatisfying: the amount of training data you need has a lot to do with how complicated the problem is that you’re trying to solve. If humans can do the task fairly quickly and unanimously, that’s a pretty easy task. Is your task like that? The best way to really answer this question is not introspection but by running a pilot. More on that later.
In my work helping companies build roadmaps and hiring plans for data science, I’ve been able to look across a number of industries. I can’t name names, but I’m summarizing from about 20 companies, ranging from two tiny startups to four very big multinationals with more than 100,000 employees. I’m a linguist by training so only a few of these are about images or music—most are about classifying text in some way that helps with search or routing. (You might think that a lot of them are sentiment projects, but only five of them are.)
- Training budgets are growing: Organizations are using machine learning/AI methods in more and more places, which means more training data. For companies with under 5,000 employees, the amount of training data from 2015 to 2016 more than doubled. For companies that have more than 5,000 employees, training data jumped by 5 times.
- Changes in your business usually require getting new training data: A machine learning system only knows what it’s trained on. So if you are launching new products/services or entering new markets, you’ll want to plan for more training data in the two months after. If you can figure out a way to get relevant data before you launch, that’s even better.
- Plan for 63,000 training items per month: Remember how I started with a bunch of caveats? This is the main one. Five of the companies I’m reporting on get more than 121,000 training items per month. The lower bound is more like 14,000 items per month.
- Get a commitment to have in-house experts review categories and examples once a quarter: Businesses change over time and you want to make sure that stakeholders continue to agree on which categories are important to track and to make sure that you’re defining them consistently. This is also a good chance to show them both exemplars of the categories and some of the most difficult items.
How to pilot
Brand new machine learning projects usually create about 131,000 training items in the first quarter when they’re launched (top quartile: 309,000, bottom quartile: 12,000). Those are the numbers but more important is how you get meaningful results.
The three things to keep in mind are:
- Plan for pilots to be iterative—you almost certainly won’t get things right in your first go-around. Plan to launch a small subset and analyze what you get back. You’ll probably need to adjust the instructions or other parts of the experimental design. It’s worth planning for a couple of iterations of this.
- Make sure the data fits the problem—there’s some business problem to be solved, making sure that the data is appropriate is important. I know one company that wanted to mine YouTube comments for sales leads for their very very high-tech equipment. There are some interesting techniques to find needles in haystacks but there still have to be needles there to find.
- As soon as you can, schedule annotation lunches—once you understand what the project, data, and categories are, grab a conference room and in-house experts to annotate the data. Get three people to judge each item, so you can report on their inter-annotator agreement. If your experts can’t do a task, how can machines or other people do it?
You probably only need a couple “annotation hours” with your in-house experts. Of course if none of them agree on how to categorize your data, you may have to go through more rounds.
In the first annotation hour, it’s fine for everyone to chat about what they’re seeing and how they’re judging it. You have three goals in that first annotation hour: (1) to make sure experts (who are probably stakeholders) get their hands dirty with the data, (2) to diagnose any problems with category definitions so you can fix them, and (3) to get “gold” data for evaluating the crowd and your ML models. A stranger—and a system—should be especially confident and accurate about items that your experts were unanimous about.
It’s budget season!
The success of machine learning efforts involves getting the right data, getting it cleaned up, picking good algorithms, and being clever about ML feature engineering. But it fundamentally comes down to asking meaningful questions and coming up with ways to get consistent answers.
The advantages of creating training data end up being two-fold: you really have to have good training data to get good ML models and you probably want to have training data flow into your system so that the system picks up changes as they occur. Perhaps less heralded is this aspect: in developing training data, you also learn a great deal about your domain and how the categories that matter to your business match the data your users are producing and interacting with. Planning is central to success. And training data budgets are central to planning.