Filter by tag

Nattering Nabobs of Negativity: Bigrams, “Nots,” and Text Classification

You can get pretty far in text classification just by treating documents as bags of words where word order doesn’t matter. So you’d treat “It’s not reliable and it’s not cheap” the same as “It’s cheap and it’s not not reliable”, even though the first is an strong indictment and the second is a qualified recommendation. Surely it’s dangerous to ignore the ways words come together to make meaning, right?

Include bigrams as features to better parse negation.

The bulk of this post is going to focus on parsing negation, but let’s start with something simpler. How about just including bigrams and trigrams in your text classifier?


Including ngrams seems like a no-brainer. And you should almost certainly include bigrams as features. But don’t be surprised if you don’t see huge increases in accuracy.

Sida Wang and Chris Manning explore what happens when you add bigrams for various text classification problems. The details are in their Tables 3 and 4, but the quick summary is that across 18 experiments of different classification tasks and different algorithms, bigrams improve accuracy over unigrams by an average of 1%. The maximum is 3.04%, the minimum ends up reducing accuracy by 0.6%.

There are lots of other features to think about, like in Quoc Le and Tomas Mikolov’s work. But as I’ll conclude at the end, the best approach is almost always to get more training data. This is probably not a surprising thing to read in CrowdFlower’s blog since their business is helping data science teams generate high quality training data at scale using their human-in-the-loop platform.

It’s important to understand ML errors and hypothesize new features and try out new algorithms. But often, your energy is better spent building a pipeline where low confidence predictions automatically get a human judgment. Since we’ll never get perfect models, it’s useful to have failsafes and ways of making the system smarter by continually incorporating new training data as things like your company’s products and your users’ slang shift.


Every night around the globe, syntacticians cry themselves to sleep. Clearly word order matters…it just doesn’t tend to matter all that much. For example, I have some customer data in front of me that routes data based on qualities like reliability and appearance, separating them by severity of sentiment as well as issue type. The word not occurs in under 5% of the data. So even if everything that has not in it is misclassified, there’s a ceiling for how much you can improve accuracy. The maximum benefit should be weighed against the amount of time you’re going to spend futzing with the machine learning to earn your percentages. In some cases, a 1% increase in accuracy is worth millions of dollars. But in plenty of cases it is not.

In this customer data, the most common use of not is in the context if something like I’m not sure. Humans are full of uncertainty. And while people are regularly livid when they have customer support issues, politeness is a really dominant cultural rule. Expressing the fact that they are not sure is a way of preserving their face and any human on the other line.

If you’re interested in approaches to negation, you may want to check out Councill, McDonald, & Velikovich (2010). They have 1,135 sentences with positive/negative/mixed/neutral ratings, 187 of these involve negation (114 were negative, 73 were positive). If all 187 of those sentences had been misclassified, you’d get a huge boost in accuracy…but actually most of them were already accurately classified without the negation system.

Part of this has to do with how people use negation. Generally negation is negative. But it’s not really a simple reverser. As Richard Socher and colleagues report, if you have a negative adjective and negate it, you’ll get something slightly positive. But not great really isn’t the same as terrible, it’s more like pretty good–so it’s attenuated not reversed.

The numbers above suggest that negation isn’t very frequent. That’s a bit domain specific, though. For example, if you are looking at doctor’s notes, you’ll find that about 20% of them involve negation: “In no acute distress”, “The heart size is normal without pericardial effusion”–there’s less agreement on what to do with something like afebrile, which semantically means ‘without fever’ as opposed to febrile, ‘with fever’. (If you’re curious about when you get your artificial intelligence doctor, here’s a post on that.)

But for most data, negation isn’t all that common. For example, in spoken English, good is about 18 times more frequent than terrible. But terrible is still about 4.8 times more frequent than not good.

Word Frequency per million spoken words (COCA)
good 1,468.36


terrible 81.90
not good 16.85
not bad 7.52
not terrible 0.20

Likewise, in IMDB reviews, both bad and not good indicate low star ratings, but there’s also 18 times more bad’s than not good’s (or any kind of negation of good). These IMDB numbers are from Chris Potts. He also shows that even in people speaking, if you are accepting or agreeing with someone you rarely use any kind of negation–only 1.8% of the time. But if you’re rejecting something, 59.4% of your rejections involve negation–that’s not even including the word no. If you include no, then 83.3% of rejections have negation. These general findings seem to be true not just in English but in German and Japanese, as well.

How do you get a smarter system?

To get a good model, you have three basic inputs:

  • Training data
  • Algorithms
  • Features

By far and away, the big thing you want is good training data. By good, I mean that humans can distinguish your categories reliably and the training data is representative of the data you’re going to be classifying.

Data scientists spend a lot of time trying out different algorithms and doing feature engineering. Those are fun and interesting but it’s rare for the latest algorithm or the most clever features to make as much of a difference as just getting more training data.

In a small project, you can usually get by with just a round or two of training data. But for more substantial machine learning initiatives, put in time and resources so that generating training data happens at a regular cadence–or continually. This serves not just as a way to make your models more accurate but to keep them accurate over time and to help make sure that you actually know how the system is performing. A system that verifies itself and improves itself over time: that’s not nothing.

Tyler Schnoebelen

Tyler Schnoebelen

Tyler Schnoebelen is the former Founder and Chief Analyst at Idibon, a company specializing in cloud-based natural language processing. Tyler has ten years of experience in UX design and research in Silicon Valley and holds a Ph.D. from Stanford, where he studied endangered languages and emoticons. He’s been featured in The New York Times Magazine, The Boston Globe, The Atlantic, and NPR.