September 23, 2016

Most academic papers and blogs about machine learning focus on improvements to algorithms and features. At the same time, the widely acknowledged truth is that throwing more training data into the mix beats work on algorithms and features. This post will get down and dirty with algorithms and features vs. training data by looking at a 12-way classification problem: people accusing banks of unfair, deceptive, or abusive practices.

**The data**

The Consumer Financial Protection Bureau has been in the news this month because they fined Wells Fargo $100 million for secretly opening accounts for people who never consented.

In addition to giant fines that make the news, the CFPB requires that banks respond to consumer complaints. They get about 277,700 complaints per year and the CFPB sends about 65% of these complaints to the banks, who have 15 days to reply. As you can imagine, it’s expensive for financial service companies to figure out what’s going on with each customer but they have to reply so only 3% of requests from the CFPB are untimely. Banks make monetary or non-monetary relief available to the consumer or explain to the CFPB why that’s not appropriate.

In their last semi-annual report, the CFPB say that $44M of relief was given to 177k consumers. In that time, they also enforced fines for about $270m (that doesn’t include Wells Fargo, which just happened). Depending on whether you use the relief number or combine these two, you’ll end up calculating that the average consumer complaint is worth $248 or $1,773.

But not all categories are equal. For example, the biggest median relief amount is for complaints about mortgages ($500), while the most common relief is for complaints about “bank accounts or services” ($110). I love this data and there’s lots more to say about it, but let’s get to the experiment: where do you get the most bang for your buck in working on a model to classify consumer complaints?

**The three inputs**

There are three basic things to play with: algorithms, features, and training data. There are lots of possibilities for all of these, but we’ll keep things fairly straight-forward.

- Amount of training: What happens when you use 20% vs. 40% vs. 60% vs. 80% vs. 100% of the 84,853 pre-June complaints
- Features: What happens when you have unigrams vs. bigrams vs. trigrams
- Algorithms: We’ll focus on multinomial Naïve Bayes versus an SVM with stochastic gradient descent (with some brief notes below on how others perform)

**Selecting the test set**

Instead of using cross-validation, I’d like to evaluate our models more realistically. In business, models are built and launched and then have to deal with brand new data as language, products, services and the rest of the world changes. Great systems keep training and updating but that’s more of an exception than the norm.

So for this project, I’ll make the test data all of the most recent narratives—12,436 consumer responses since June 1st of this year. Here’s the breakdown of the test set, compared to the percentage of all the data available before June. Basically, there’s going to be more “Bank account or service” in the test data than there is in the training data and less “Debt collection”. There are also a lot more complaints about Citibank in the test set.

Since there’s a lot going on in this post, I’m going to keep my evaluation statistic very simple—I’ll just be reporting accuracy: how many of the predictions for a model match the true classification? As you can imagine, regardless of the percentage of this data used, it’s going to get tiny categories like “Virtual currency” and “Other” right.

**Where to begin**

Imagine you only have 20% of the training data—16,931 complaints sent before June. How many of the June-Sept test items can you correctly predict? The baseline for this 12-way classification is described by doing the dumbest thing possible: if you guess the majority class every time, you’ll always guess “Debt collection” and be correct 22.60% of the time.

Happily, nearly-out-of-the-box scikit-learn models are going to do a lot better than that. For unigrams, bigrams and trigrams, I’ll ignore case and punctuation (I’ll interpret ngrams as basically white-space delimited).

I’ll also remove stop words. I go out of my way to avoid removing stop words because I believe they often encode important information, but in this case, we do 1.17% better by excluding them. We also do about 0.48% better if we drop all the ngrams that have fewer than 5 occurrences.

I tend to begin with multinomial Naïve Bayes, and that gets 82.17% of the test set correct when the model is built from the smallest chunk of training data with unigrams. Bigrams gets up to 82.37% (+0.20%). Trigrams don’t help, taking accuracy to 82.00%. (This is a pretty common finding, you might also want to check out the post on Nattering Nabobs of Negativity: Bigrams, “Nots,” and Text Classification.)

Another good algorithm to explore is SVM, specifically with stochastic gradient descent learning. This will ultimately get us the highest level of accuracy BUT with only 20% of the data it’s a bit of a toss-up with the Naïve Bayes model. With SVM, the initial unigram model is going to be 80.76% correct. Going to bigrams increases accuracy to 81.41% (+0.65%) and trigrams gets you to 82.37% (+1.62%).

We could also play with logistic regression, random forests, and an SVM without gradient descent. But if start with just 20% of the training data and bigrams, these all do worse than the algorithms above, so for the sake of sanity I won’t be talking about them any more.

**What if we just add more training data?**

Even though I’ve kept defaults for algorithms, at this point, I’ve tried five different algorithmic approaches, several vectorization strategies, and three sets of features. What happens if we just double the amount of training data?

We can jump up to 83.37% accuracy if we give a bigram multinomial Naïve Bayes model 40% of the training data (+1.00%). The SVM will jump up to 82.24% accuracy (+0.83%).

Overall, if we increase our training data to be five times what it was, we can get to an accuracy of 85.32% with an SVM model using bigrams.

**Training data > algorithms and features**

If we hold the training amount constant and just tweak which algorithm we use and whether we use unigrams, bigrams, or trigrams, there are 75 different comparisons we can make. What is the biggest jump in accuracy? That would be when you have 40% of the training data and go from an SVM with trigrams (79.79%) to a Naïve Bayes model with bigrams (83.37%). A +3.58% increase in accuracy.

That said, the median increase from all this playing around with algorithms and features is +0.96% within SVM experiments (mean: +0.32%) and +0.27% within the Naïve Bayes experiments (mean: +0.34%). Whichever algorithm you consider to be your base, going between them gets a median change of +0.08% (mean: +0.02%).

On the other hand, if you hold algorithm and features constant and just alter how much training data you use (54 different comparisons total), you can get +5.47% more accuracy by going from an SVM trigram model with 40% of the training data to an SVM trigram model with 100% of the training data. The maximum benefit for the Naïve Bayes models is smaller–the biggest improvement is when you have 20% of the training data with trigrams and jump to 100% of the training data, which improves accuracy by +1.87%.

But in terms of what’s likely, the median improvement to SVM models by adding more data is +1.98% (mean: +2.11%). The median improvement to Naïve Bayes models by adding more data is +0.39% (mean: +0.56%).

There are definitely more ways of playing around with the algorithms and the features. And some of these experiments could have great benefits. But we’re talking about tinkering, hypothesis generation and testing. And what we want is to get a sense of how big a difference a change is going to make to our accuracy.

**Going even bigger**

In this data, we varied whether we had between 3 million or 16 million words worth of training data. The foundational work on the effectiveness of training data size is Michele Banko and Eric Banko. In this example, the question is how good a model can be at recognizing which member of a confusion set is appropriate–so if your model saw *I have {CONFUSION SET} pears*, can it predict that it’s *two* and not *to* or *too*? The learning curves here show what happens when you train on vastly bigger data sets.

For an overview of these issues, you should also check out the “unreasonable effectiveness of data” by Alon Halevy, Peter Norvig and Fernando Pereira. And if you’d like to play around with a model yourself, you can use this one over on the Cortana Intelligence Gallery, which is a place Microsoft built for data scientists to share models.

When you try out new features and algorithms, things can move backwards. That’s rare in training, where you almost always get improvements and the improvements themselves are usually bigger. Obviously, exploring features and algorithms helps get a handle on the data and that can pay dividends beyond accuracy metrics. But in terms of benefits, more data beats better algorithms.