The Perilous World of Machine Learning for Fun and Profit: Pipeline Jungles and Hidden Feedback Loops

1/5/2015

I haven't written a blog post in ages. And while I don't want to give anything away, the main reason I haven't been writing is that I've been too busy doing my day job at MailChimp. The data science team has been working closely with others at the company to do some fun things in the coming year.

That said, I got inspired to write a quick post by this excellent short paper out of Google, "Machine Learning: The High Interest Credit Card of Technical Debt."

Anyone who plans on building production mathematical modeling systems for a living needs to keep a copy of that paper close.

And while I don't want to recap the whole paper here, I want to highlight some pieces of it that hit close to home.

Pipeline Jungles

George prototyping a machine learning model.

There was a time as a boy when my favorite book was George's Marvelous Medicine by Roald Dahl. The book is full of all that mischief and malice that makes Dahl books so much fun.

In the book, George wanders around his house finding chemicals to mix up into a brown soup to give to his grandmother in place of her normal medicine. And reading this bit of felony grand-matricide as a child always made me smile.

Prototyping a new machine learning model is like George's quest for toxic chemicals. It's a chance for the data scientist to root around their company looking for data sources and engineering features that help predict an outcome.

A little bit of these log files. A dash of Google Analytics data. Some of Marge-from-Accounting's spreadsheet.

POOF! We have a marvelous model.

How fun it is to regale others with tales of how you found that a combination of reddit upvotes, the lunar calendar, and the number of times your yoga instructor says FODMAPs is actually somewhat predictive!

But now it's the job of some poor sucker dev to take your prototype model, which pulls from innumerable sources (hell, you probably scraped Twitter too just for good measure), and turn it into a production system.

All of a sudden there's a "pipeline jungle," a jumbled up stream of data sources and glue code for feature engineering and combination, to create something programmatically and reliably in production that you only had to create once manually in your George's-Marvelous-Medicine-revelry.

It's easy in the research and design phase of a machine learning project to over-engineer the product. Too many data sources, too many exotic and brittle features, and as a corollary, too complex a model. One trap the paper points out is leaving in low powered features in your prototype model, because well, they help a little, and they're not hurting anyone right?

What's the value of those features versus the cost of leaving them in? That's extra code to maintain, maybe an extra source to pull from. And as the Google paper notes, the world changes, data changes, and every model feature is a potential risk for breaking everything.

Remember, the tech press (and vendors) would have you build a deep learning model that's fed scraped data from the internet's butthole, but it's important to exercise a little self-control. As the authors of the technical debt paper put it, "Research solutions that provide a tiny accuracy benefit at the cost of massive increases in system complexity are rarely wise practice." Preach.

Who's going to own this model and care for it and love it and feed it and put a band-aid on its cuts when its decision thresholds start to drift? Since it's going to cost money in terms of manpower and tied up resources to maintain this model, what is the worth of this model to the business? If it's not that important of a model (and by important, I'm usually talking top line revenue), then maybe a logistic regression with a few interactions will do you nicely.

Humans are Feedback Loop Machines

The graveyard at Haworth

In Haworth England, they used to bury bodies at the top of a hill above the town. When someone died, they got carted up to the overcrowded graveyard and then their choleraic juices would seep into the water supply and infect those down below, creating more bodies for the graveyard.

Haworth had a particularly nasty feedback loop.

Machine learning models suck up all sorts of nasty dead body water too.

At MailChimp, if I know that a user is going to be a spammer in the future, I can shut them down now using a machine learning model and a swift kick to the user's derriere.

But that future I'm changing will someday, maybe next week, maybe next year, be the machine learning system's present day.

And any attempt to train on present day data, data which has now been polluted by the business's model-driven actions (dead spammers buried at the top of the hill), is fraught with peril. It's a feedback loop. All of a sudden, maybe I don't have any spammers to train my ML model on, because I've shut them all down. And now my newly trained model thinks spamming is more unlikely than I know it to be.

Of course, such feedback loops can be mitigated in many ways. Holdout sets for example.

But we can only mitigate a feedback loop if we know about its existence, and we as humans are awesome at generating feedback loops and terrible at recognizing them.

Think about time-travel in fiction. Once you have a time machine (and make no mistake, a well-suited ML model is pretty close to a forward-leaping time machine when it comes to things like sales and marketing), it's easy to jump through time and monkey with events, but it's hard to anticipate all the consequences of those changes and how they might alter your future training data.

And yet when the outputs of ML models are put in the hands of others to act on, you can bet that the future (and the future pool of training data with it) will be altered. That's the point! I don't predict spammers to do nothing about them! Predictions are meant to be acted upon.

And so, when the police predict that a community is full of criminals and then they start harassing that community, what do you think is going to happen? The future training data gets affected by the police's "special attention." Predictive modeling feeds back into systematic discrimination.

But we shouldn't expect cops to understand that they're burying their dead at the top of the hill.

This is one of my fears with the pedestrianization of data science techniques. As we put predictive models more and more in the hands of the layperson, have we considered that we might cut anyone out of the loop who even understands or cares about their misuse?

Get Integrated, Stay Alert

The technical debt paper makes this astute observation, "It’s worth noting that glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated 'research' and 'engineering' roles."

This is absolutely true. When data scientists treat production implementation as a black box they shove their prototypes through and when engineers treat ML packages as black boxes they shove data pipelines through, problems abound.

Mathematical modelers need to stay close to engineers when building production data systems. Both need to keep each other in mind and keep the business in mind. The goal is not to use deep learning. The goal is not to program in Go. The goal is to create a system for the business that lives on. And in that context, accuracy, maintainability, sturdiness...they all hold equal weight.

So as a data scientist keep your stats buds close and your colleagues from other teams (engineers, MBAs, legal, ....) closer with the goal of getting work done together. It's the only way your models will survive past prototype.

1 Comment