Blog
Filter by tag

Making a computer into a super-recognizer

In London, there is a team of half a dozen police officers known as “super-recognizers.” They have an uncanny ability to recognize faces, allowing them to confidently identify criminal when their faces are obscured or recall suspects dozens of years after their only encounter. This superhuman skill makes them invaluable at identifying culprits in disguise or picking questionable characters out of CCTV footage, even if they’re partially in disguise.

So how does someone become a super-recognizer? Mostly, they’re just born that way. But it’s important to zero in on the “recognizer” part of the name here. No matter how good these officers are at pattern matching and facial recall, they do need to have seen the person first to actually recognize them. The more people they see, the more people they can recognize.

Which brings us to computer vision.

Annotating the face of a man who really should be buttoning his shirt a bit more

Training a CV model follows a similar trajectory. The more data a model sees, the more precise it becomes. Facebook’s DeepFace model, famously, is human-level accurate when it comes to identifying people. Why is that? Well, there are a lot of smart people massaging the models over at Facebook, but the biggest difference between what they can do and anybody else really rests in their training data. They have hundreds of billions of tagged user photos with hundreds of millions uploaded daily. That’s a staggering amount of training data.

And it goes beyond quantity. Facebook has some data quality advantages above, say, the FBI database. That’s because their users tend to post pictures of people they know and in good lighting, so the tags are almost always accurate. In fact, the quality and quantity of data was so good that automatically generated tag suggestions started just six months after tagging became a feature in 2010.

Generally, this is the process for making your model a super-recognizer: great algorithms and tons of high-quality training data. But, of course, it’s more than that. Your model needs to know what to pay attention to, not just see lots of data.

Human super-recognizers seem to pay attention to features that normal people miss, the characteristics that make a face unique, whether those characteristics are obvious (a scar above the eye) or subtle (the slope of a nose or distance between the corners of a mouth and a chin). In machine learning and AI, these characteristics are called feature selection. They answer the question: what should your models be paying attention to in order to be accurate in a given task?

So if we’re looking to create a super-recognizer, so far, we’ve highlighted that you’ll need to look at the right features in a vast amount high-quality training data. Let’s look at how you process your images in the first place.

The kinds of image processing tasks

What does it mean to do image processing? How do you turn raw images into training data?

The simplest kind of image processing task is categorization. For example, you just have a person look at an image and answer a question about it, like “Is there a boat in this picture? Yes/No” or:

Check the box for all of the things that are in this picture:

  • Water
  • Boat
  • Person
  • Clouds
  • Shoreline
  • Sea monster

It’s really easy to get fast, accurate results for these kinds of tasks. If you don’t need to know exactly where in an image something is, this is a great way of getting data. If you’re asking about common, concrete nouns, people can get you 100,000 annotated images a day.

In many cases, you care about where different things are within a picture. There are three basic ways to do this:

  • Get every pixel annotated
  • Have people draw bounding boxes around the images
  • Have people put a dot in the middle of the thing you care about

Bounding boxes are a lot easier for humans to do than pixel-by-pixel annotation since you only have to outline the objects. Since it’s easier and doesn’t involve touching every single pixel, it is less expensive. CrowdFlower supports all the different kinds of image annotation but having seen a bunch of use cases, the standard recommendation is to use bounding boxes.

Most companies want to get between 30 and 40 different kinds of things labeled. It’s almost impossible to buy already-annotated datasets that have the kinds of objects you need (it’s very-super-impossible to get such a dataset on the pixel-by-pixel level).

How much data do I need?

A given image could have 0, 1, or a whole lot of a certain object in a given image. That doesn’t matter for image categorization (which is cheap), but when it comes to drawing outlines or placing centered pinpoints, it can mean that people need to spend a lot more time annotating any given photo. That said, you may need fewer annotations overall if your pictures are chock full of the objects you’re trying to detect.

If you go the pixel-by-pixel route, you’ll get a lot of training data from any given picture, though you still need to make sure you have enough of the different things you’re trying to detect. Whether by-pixel or bounding box, you may want to consider doing an image categorization task as a first step. If you have a lot of images and you really want to focus on some subset, categorization tasks could help you narrow your scope.

Okay, okay, but what about the amounts of data? There’s really no hard-and-fast rule here. Generally, think of this trade-off: the more accurate you need to be, the more data you’ll need. Past that, you’ll likely need a ton.

For example, a couple years ago, a small startup started getting pictures of clothing annotated: less than a year after they started, they were selected as the machine learning partner for a very large clothing retailer. To get best-in-class performance that was mindful of current styles, they ran about 200,000 photos per month.

Finding clothing is not mission critical in the way that “not hitting people with your self-driving car” is. Facebook failing to identify the Beatles crossing Abbey Road isn’t as big of a deal as Tesla thinking they are just part of the crosswalk.

A self-driving Tesla needs to see something different in this photo than Facebook does: pedestrians versus celebrities

For bulletproof classification—where there’s the threat of injury—you probably need between 100,000 and 1,000,000 labeled images per category. But the trick here is that you have to have separate labels for different kinds of people. An autonomous vehicle needs to understand how the things it’s seeing are going to behave and children don’t just look different than elderly people, their attention, motivation, and actions are different, too. There’s also a difference between adults crossing a street by themselves versus pushing a stroller.

Making sure quality is high

The most popular image processing task in the history of CrowdFlower was called “CATS CATS CATS” and it asked something very simple: “Is there a cat in this picture?” This is something pretty much anyone can do and something people love doing. The task went exceedingly quickly and because it was both fun and easy.

One of the Hermitage museum’s fanciest inhabitants

That’s one end of the spectrum. What about things that require experts? It turns out that the crowd does very well with all kinds of things traditionally reserved for people with special training. For example, GIS professionals routinely help label photographs of disaster areas, but the crowd can help focus professional efforts to the most difficult cases. Another CrowdFlower client was categorizing pictures of ear canals to see if there was infection—the accuracy from the crowd matched that of Ear, Nose, and Throat doctors. And finally, even something people aren’t used to seeing—cellular Mitsosis—is something that lay people can be trained to help with.

We have two kinds of crowds: the general population, which you can segment a number of ways (for example, maybe you only want people who live in the United Kingdom) and special, curated groups of individuals who can work more consistently, under non-disclosure agreements, and who either have special skills or can learn them.

In both cases, we calculate a per-person trust score that follows them from project to project. Much of this comes from the idea of having the crowd see “gold” data that have known answers.

But let’s say that I have a high trust score—how do you know if my annotation for the next image is right? What if I were sleepy or accidentally clicked the wrong button?

For simple tasks, we can rely on getting multiple judgments per item. If three trustworthy people in the crowd all select the same thing, that’s a strong vote of confidence. By contrast, if there are three different answers, well, that may suggest a problem with the crowd’s attention, but it could also mean that the categories or their definitions need fixing up.

Voting works best for classification tasks that involve answering yes/no questions or marking check boxes. But many image processing needs require showing where the particular content is, either by labeling the picture pixel-by-pixel or by drawing outlines. These are more time-intensive tasks for people, so even the affordable crowd will get really expensive if you try triple annotation. And especially in the case of drawing bounding boxes, it is a computationally complex task to decide that three people’s placement of lines is “close enough” to be right.

So when it comes to annotating within images, the best thing to do is to assign an image to be marked up by one person but then to show the output to another person for a quick review. Computers need to be told what really counts as a good outline or the “center of an object”—but humans who review humans are really good at that.

Rather than having two totally separate populations, it can be useful to have individuals move back and forth between annotating and reviewing. Consider the human cost—if you ask people to outline apples and there’s only one apple per photo, that’s pretty easy, but often photos of apples show bushels of the fruit, so working through one image can take a long time. After getting through a couple of those, it can be useful to shift into review mode, which is a lot easier. This switching back-and-forth improves overall quality, as long as you balance the need for breaks with the need for keeping annotators in a state of flow.

The basic phases of a project

The basic phases of a project are going to be:

  • Define the project: You can ask a computer or a human what’s in a photo, but ultimately either of them need to know why you’re asking. What do you want to do? What’s the business case? What’s the user flow? And most importantly: what are the things you need to be able to detect? Those are the categories you need to get training data around. It is also useful at the outset to specify trade-offs you’re willing to make around performance, accuracy, and simplicity. Is a highly accurate black box method okay or do you have audit needs that require you to know why an image was classified the way it was?
  • Get the data: The highest level of accuracy comes from creating training data that’s like the data you actually need to handle. If you want to detect apples, you wouldn’t want to show a bunch of medical images. Do you need to put in both Granny Smiths and iPhones? That also depends on what you’re trying to do.
  • Run a pilot: You want to see if your categories are really discernible in the data and to give your machine learning people some small amount of data to try different algorithms on.
  • Iterate: The pilot will refine what you use as training data—perhaps one category is rare, so you need to go off and find more examples of them. It may also tweak your categories or at least how you define them.
  • Go bigger: You may go through a couple of small pilots of a few hundred images, but you will need a lot more—as an upper bound, consider self-driving cars. There are huge, awful consequences for a car not recognizing a pedestrian, so you’ll need something like 100,000 to 1,000,000 images per important category if you want to be bulletproof. Another word to the wise: children and elderly people have different behaviors in the world, so you probably need to consider them as separate categories.
  • Get smarter: Engineer the system to take advantage of end-user actions—those are also kinds of annotations. The kinds of images your system is going to categorize will probably change over time: so also plan on constant or at least regular refreshes of training data.

The team that executes

The earliest steps of a project are going to involve thinking about business cases and sorting out product requirements.

If you’re starting entirely from scratch, you’re going to need people with back-end engineering, big data and database skills to work up a system architecture that can take in content, index, log, and store it. You probably need some kind of analytics capability to let data scientists and/or users understand what’s going through the system and of course, there may well be a front-end UI that needs to be built for users to interact with.

In the middle of all of that is a machine learning/artificial intelligence project that constitutes the core. There are three basic roles we see across companies:

  • Machine learning engineer: There are a variety of algorithms used to automatically identify the content of images—a machine learning engineer is the person who figures out which ones work best and thinks about the overall pipeline of inputs and outputs. They also think about how the system can get smarter, building feedback loops so that the system keeps getting training data over time that makes it better and better at processing the images being thrown at it.
  • Image scientist/imaging scientist: These are people who know the ins-and-outs of visual information. They have expertise in the project area, whether that’s image retrieval, face recognition, pattern recognition, image understanding, relevance feedback, or video processing. Sometimes the machine learning engineer and the imaging scientist are the same person, but it can be useful to have someone think about the pipeline and another person thinking about the special cases and qualities of image processing.
  • Program manager/project manager: Beyond their normal job of keeping people focused and projects on-time, project managers are often the ones who learn how to use a system like CrowdFlower to get annotations that meet the business needs. As communicators and coordinators, project managers are good bridges between their project teams and the CrowdFlower customer success team.

This isn’t a comprehensive list—we also see project teams include software engineers or people with other kinds of expertise (e.g., roboticists) but it’s a fairly normal breakdown for most projects. And since the goal is to find content inside unknown images, all these people play an active role defining which data should get annotated and which labels humans should use (those are the categories that computers will detect).

And, on a pretty high level, that’s how you scope out an image annotation job. With enough high-quality training data, the right group of labelers, and a well-run team, you’ll be on your way to creating a super-recognizer. We’ll be back the next couple weeks with more on image labeling. Till then, thanks for reading.

Tyler Schnoebelen

Tyler Schnoebelen

Tyler Schnoebelen is the former Founder and Chief Analyst at Idibon, a company specializing in cloud-based natural language processing. Tyler has ten years of experience in UX design and research in Silicon Valley and holds a Ph.D. from Stanford, where he studied endangered languages and emoticons. He’s been featured in The New York Times Magazine, The Boston Globe, The Atlantic, and NPR.