Filter by tag

The Top 11 Mistakes to Avoid When Shipping Machine Learning Products

In spite of the tremendous advances in the Artificial Intelligence field in recent years, the ability to capitalize on machine learning has remained too rare of an occurrence in the market. Many companies still struggle to see the return on the huge investments they made to integrate artificial intelligence into their strategy. This is a concerning result for machine learning researchers and engineers whose fate and future depend on the continued interest of their employers in the output of their work.

So let’s step back for a moment and try to understand why the success of theoretical AI is so hard to translate into functional AI products. There are several bottlenecks to consider when developing machine learning models, not the least of which being to obtain high quality data and structure it appropriately for the job. Fortunately, this is precisely where Figure Eight can help. We’ve been providing a best-in-class platform for testing, training, and tuning machine learning models for a decade now, and deeply understand the training data that powers the most innovative, ambitious projects in the world.

However, once you pass this first obstacle, there remains another important challenge: deploying those models to production and managing them in the longer term. This process is usually referred to as Machine Learning Lifecycle Management, or Machine Learning DevOps. Machine Learning Lifecycle Management is critically important to the success of any ML project, yet extremely error-prone. The good news is that avoiding the mistakes that can put the project at risk is relatively simple once one knows what not to do. Below, we’ll cover eleven different pain points and how to solve for each:

1. Lack of flexibility

When it is time to deploy a system to production after countless hours of development, it is only natural to start out with a set of assumptions regarding the DevOps process. This may be a way for us to feel we have some control over the complicated systems we create. But the reality couldn’t be more different: once models are being converted into real-life data products, things are never as simple and straightforward as we believed they would be. The unexpected changes on the market or the existence of seasonal patterns, for example, depend on many parameters over which we have no control whatsoever. For example, in the context of eCommerce, it is important to adapt the frequency with which a model is retrained to the market.

Failure to remain flexible and to embrace the dynamism of the machine learning solutions they create is one of the top reasons why data scientists don’t see their models live up to their own expectations once in production. Working under the assumption that things will change and “stuff will happen” is an absolute must, which is why it is critical for companies to keep data scientists involved throughout the entire process as opposed to transitioning the models to the engineering department.

2. Not monitoring continuously

Our last section hopefully hammered home the fact that the work of a data scientist is not done the moment when his/her model is in production. The model’s accuracy as computed during the testing phase of the development process is certainly an important indication of how the model will eventually perform in reality, but it remains mostly theoretical. Also, the accuracy typically degrades over time due to data drift, and there is no good way to measure how quickly it will drop and when the time is right to retrain.

When working with a model predicting an observable value–such as a stock price or the number of units of a specific item sold in a store–keeping track of the accuracy in near real-time is straightforward. If the predicting outcome isn’t directly measurable (like in the case of recommender system or of a search engine), it is a little more difficult. However, it can be easily managed using a human-in-the-loop approach by periodically sending a sample of the data to human contributors for evaluation and using the result as a benchmark. Continuously monitoring the performance of a launched model is key to the continued performance of an AI-powered product necessary.

3. Monitoring the outputs only

People wrongly assume that when a model’s performance starts dropping, the model is necessarily to blame. But DevOps for machine learning systems is challenging precisely because ML products combine not only the model and the system on which the model is deployed, but also the data it uses for inference.

When an engineering system with no ML component starts seeing consistent failures, we don’t usually assume that the code is the problem; we might however start to wonder if some of the underlying systems might be the cause. Similarly, when a feature powered by machine learning suddenly starts going rogue, it is reasonable to question either the system or the data rather than the model itself. In other words, the model typically doesn’t “spoil away” overnight.

However, because data drift is a real issue in the real world, the parameters obtained the last time that the model was trained might not represent the current situation very well anymore due to changes in the customers’ usage patterns, or the appearance of a new competitor on the market, for example. To avoid unfortunate conclusions, it is always good practice to keep an eye on the input data that is fed into the running models. More often than not, keeping track of the statistical signature of those inputs over time to identify any drift is sufficient to catch problems early and predict when a model might start failing.

4. Not defining good business metrics

The understanding of the actual business applications of AI is critically missing from most major data science curricula todays. Simply put, the majority of data scientists on the job market have not been trained to think in terms of return on investment, risk, or customer impact.

Why is this? Perhaps it’s because most machine learning experts currently employed by AI companies have been trained as computer scientists or applied mathematicians, machine learning is still practiced as a scientific rather than a business discipline. Good data scientists though know that measuring their models’ performance in terms of the value it brings to their users rather than in terms of accuracy or F-score is the way to go. Ultimately, it is through business metrics that the quality of a model needs to be evaluated.

5. Using too much data

Similarly, because data scientists are taught to optimize for model accuracy, they usually tend to use as much data as possible when training their models. Yet for most models, accuracy isn’t proportional to the amount of data used, and usually asymptotically tends towards a given value.

What this means is that it is often possible to reach 99% of the maximal accuracy using significantly less data. In fact, at my last job, I found that our team was using an average of 4 times as much data as they could’ve afforded to. This is particularly important for companies where compute power is limited, or who use cloud compute services. And besides, if a model needs to be retrained regularly, not understanding the optimal amount of data you need might cause significant incurred cost.

For example, there is no point in retraining a recommendation system if only a fraction of the users are actually using it. To address this, data scientists can build a learning curve in order to identify the optimal amount of data required to train their model without breaking the bank or hoarding their company’s servers for more time than really necessary. Additionally, if the amount of data necessary to reach the desired accuracy is still high, the best option is to use active learning, a process allowing learning agents to intelligently decide which rows to learn from, which has been shown to often reduce the volume of data required by more than half.

6. Automating too early

Data scientists aren’t usually very excited about the topic of Machine Learning Lifecycle Management because, to be honest, they’d much rather work on new exciting problems than on improving old models. As a consequence, there is a temptation for many to automate their ML product from the get-go.

But as we have seen before, it is often dangerous to push a model to production with the assumption that it will perform as expected. Even the best models have limitations, and before attempting to automate their lifecycle, it is necessary to go through a phase of understanding of their own specificities in order to account for corner cases and gain some knowledge around the model’s areas of softness. Automating a model that isn’t well understood and tested in real-life conditions is a common mistake that is easily avoided.

7. Not keeping it simple and being unrealistic

Most readers will know Occam’s razor: the simplest answer is usually correct. For hundreds of years, the principle of parsimony has been very popular among scientists who believe in simple, powerful solutions. Yet, quite paradoxically, the data science community today tends to enjoy experimenting with the newest methods, sacrificing simplicity to the thrill of trying out something new and unusual, a bias fairly common within the tech industry.

But successful data scientists know better. They realize that starting with a simplistic model before gradually complexifying their solution is the way to go. They are also fully aware that their chances to build a model with an accuracy in the high 90’s will be much easier to do by delivering in increments. Starting with a minimal viable product at first is a just as good of a practice for machine learning system than it is for any other software development project.

8. Not using problems as opportunities to learn the system

If a model is underperforming, it is often tempting to go back to the drawing board and start all over. After all, that very first model is only one of the many options that were available at the time it was developed, and being scientists, data scientists often crave to experiment with a different approach.

The problem though is that this is basically throwing out the baby with the bathwater. Failing models are a mine of information regarding corner cases, the way people use the data product, and technical limitations that need to be address during the next iteration. Whenever we decide to create something new, we lose an opportunity to understand both the business problem at hand and the system in depth, and we are likely to repeat some of the same mistakes over time.

9. Not testing thoroughly prior to shipping the model

No matter how hard we try to obtain representative data sets to build our models, the samples we get remain a simple representation of what can be observed in real life conditions. For example, when training a speech recognition model, we might only have access to a sample of audio files contains only spoken utterance from native English speakers, in which case the model would generalize poorly to people with foreign accents, creating a disappointing experience for many real-life users.

As data scientists, we need to ensure that corner cases and exceptions are accounted for as much as possible given the quality of the data that is available for the task at hand. And because adequate testing prior to feature launch can be difficult in practice, for example for models that are difficult to validate offline, the usage of online testing such as A/B-testing should also be considered. As a general rule, thorough testing should happen sooner rather than later, and definitely should happen prior to the engineering QA process.

10. Failing to build with explainability in mind

Debugging any issue in production is always painful, no matter the system. But if we consider the number of moving parts involved in the creation of a model, ranging from the many different features that can come into play as inputs to a model, finding and fixing problems might turn to be almost impossible to do in a reasonable amount of time if the model isn’t explainable.

Explainability isn’t only a way to provide some transparency into the decisions that an algorithm makes, it is also the best way to ensure that the human who built the model, as well as others, can trace problems back to their origin. Building models that have as many interpretable elements as possible makes a huge difference when it comes to maintaining them on the longer term.

11. Not using Human-in-the-Loop

For the many reasons we have stated so far and for many more, full automation remains a myth for now. However, the Pareto principle tells us, rightfully so, that 80% of the effects originate from 20% of the causes, and therefore than automating 80% of the system is a fairly easy thing to do while reaching a completely autonomous system would take a tremendous amount of effort.

In other words, it is wise to take an approach similar to building an auto-pilot for an aircraft which automated the most straightforward parts of the system while leaving a human in control whenever a difficult or sensitive task needs to be completed. Human-assisted automation has proven over and over again to be better than full automation.

So when building ML products, you can’t go wrong by requesting your users’ explicit feedback, asking them to rectify what they would consider to be mistakes, or even building a crowdsourced QA strategy to validate the quality of your solution.


Hopefully, these tips and best practices related to Machine Learning Lifecycle Management, will help you push your models in production while keeping in check most of the risks associated to the deployment of ML solutions. There are of course many different ways to avoid the various mistakes cited before, and only you and your team can identify what the best process for your specific use case might be. However, no matter how you proceed, make sure you have a well-defined long-term strategy for the maintenance of your model and that you leave as little as possible to chance.

Happy modeling!

Jennifer Prendki

Jennifer Prendki

Jennifer is Figure Eight's VP of Machine Learning. She's spent most of her career creating a data-driven culture wherever she went, succeeding in sometimes highly skeptical environments. Jennifer is particularly skilled at building and scaling high-performance Machine Learning teams, and is known for enjoying a good challenge.