John Foreman, Data Scientist
  • Home
  • Data Smart book
  • Speaking & Events
  • Featured Talks
  • Blog
  • MailChimp
Contact

Surviving Data Science "at the Speed of Hype"

1/30/2015

14 Comments

 
Being in the world of big data and data science, I see a lot of stuff like this: 

Analytics at the speed of big data:

#Infographic: 82x faster insights. That's analytics at the speed of big data. http://t.co/fCnPshuhVC

— Erika Riehle (@erikariehle) October 29, 2014
Computing at the speed of innovation:

[VIDEO] Computing at the speed of innovation: http://t.co/gqzCwsFqVm #BigData

— HP Discover (@HPDiscover) January 27, 2015
Big Data at the speed of light?

To Infinity and Beyond – #BigData at the speed of light with @SkytreeHQ and SPARK: http://t.co/90ahLUxcx7 #DataScience #MachineLearning

— Kirk Borne (@KirkDBorne) December 27, 2014
Big data at the speed of thought! Now that's more like it...

Analyse big data at the speed of thought and drive rapid innovation. Choose #SAP #BUSINESS - http://t.co/DpcguWbFca

— AlphaSirius (@AlphaSirius_SAP) December 17, 2014
And my personal favorite, big data....at the speed of big data.

"The @sgi_corp UV 2 really lets you explore #BigData at the speed of Big Data."- Kalev H. Leetaru, …http://t.co/ZNF7CBYWAD

— E4 (@e4company) November 21, 2014
There is this idea endemic to the marketing of data science that big data analysis can happen quickly, supporting an innovative and rapidly changing company. But in my experience and in the experience of many of the analysts I know, this marketing idea bears little resemblance to reality.

Over the course of my career, I've built optimization models for a number of businesses, some large, like Royal Caribbean or Coke, some smaller, like MailChimp circa 2012. And the one thing I've learned about optimization models, for example, is that as soon as you've "finished" coding and deploying your model the business changes right under your nose, rendering your model fundamentally useless. And you have to change the optimization model to address the new process. 

Once upon a time, I built a model for Dell that optimized the distribution of their chassis and monitors from China to their fulfillment centers in the U.S. Over and over again, my team worked on customizing our model to Dell's supply chain. The moment the project was over...Dell closed down a factory and jacked the formulation. Now, we had done some things to make the model robust in such scenarios (made factories a flexible set in the ILOG OPL code for example). But nonetheless, the model was messed up, and someone needed to fix it.

And this example was for a relatively large and stable company. Dell sure moves slower than, say, a tech startup. But with each passing year, the young, turbulent company seems more the norm than the old rigid enterprise. The speed at which businesses are changing is accelerating.

And most data science models that are of any degree of sophistication, require stability.

A good demand forecast might need several seasonal cycles of historical data.

A good optimization model requires an ironed out process (manufacturing, logistics, customer support, etc.).

A good predictive model requires a stable set of inputs with a predictable range of values that won't drift away from the training set. And the response variable needs to remain of organizational interest.

Process stability and "speed of BLAH" are not awesome bedfellows. Supervised AI models hate pivoting. When a business is changing a lot, that means processes get monkeyed with. Maybe customer support starts working in different shifts, maybe a new product gets released or prices are changed and that shifts demand from historical levels, or maybe your customer base changes to a younger demographic than your ML models have training data for targeting. 



Whatever the change may be, younger, smaller companies mean more turbulence and less opportunity for monolithic analytics projects.

And that is not primarily a tool problem.

A lot of vendors want to cast the problem as a technological one. That if only you had the right tools then your analytics could stay ahead of the changing business in time for your data to inform the change rather than lag behind it.

This is bullshit. As Kevin Hillstrom put it recently:

If IBM Watson can find hidden correlations that help your business, then why can't IBM Watson stem a 3 year sales drop at IBM?

— Kevin Hillstrom (@minethatdata) January 25, 2015
In other words, it's very hard for sophisticated analytics software and techniques running on "big data" to run out in front of your changing business and radically benefit it.

The most sophisticated analytics systems we have examples of run on stable problems. For example, ad targeting at Facebook and Google. This business model isn't changing much, and when it does, it's financially worth it to modify the model.

Airline scheduling. Oil exploration. High frequency trading.

For a model operating on these problems, the rules of the game are fairly established and the potential revenue gains/losses are substantial.

But what about forecasting demand for your new bedazzled chip clip on Etsy? What about predicting who's a fraudster lurking within your online marketplace? Is your business stable enough and the revenue potential high enough to keep someone constantly working on "analytics at the speed of big data" to use a model in this context? 

Analytics at the speed of meat and potatoes

You know what can keep up with a rapidly changing business?

Solid summary analysis of data. Especially when conducted by an analyst who's paying attention, can identify what's happening in the business, and can communicate their analysis in that chaotic context.

Boring, I know. But if you're a nomad living out of a yurt, you dig a hole, not a sewer system.

Simple analyses don't require huge models that get blown away when the business changes. Just yesterday I pulled a bunch of medians out of a system here at MailChimp. What is the median time it takes for 50% of a user's clicks to come in after they've sent an email campaign? I can pull that, I can communicate it. And I can even give some color commentary on why that value is important to our business. (It lends perspective to our default A/B test length for example.)

If you want to move at the speed of "now, light, big data, thought, stuff," pick your big data analytics battles. If your business is currently too chaotic to support a complex model, don't build one. Focus on providing solid, simple analysis until an opportunity arises that is revenue-important enough and stable enough to merit the type of investment a full-fledged data science modeling effort requires.


But how do I feel good about my graduate degree if all I'm doing is pulling a median?

If your goal is to positively impact the business, not to build a clustering algorithm that leverages storm and the Twitter API, you'll be OK.
14 Comments

The Perilous World of Machine Learning for Fun and Profit: Pipeline Jungles and Hidden Feedback Loops

1/5/2015

1 Comment

 
I haven't written a blog post in ages. And while I don't want to give anything away, the main reason I haven't been writing is that I've been too busy doing my day job at MailChimp. The data science team has been working closely with others at the company to do some fun things in the coming year.

That said, I got inspired to write a quick post by this excellent short paper out of Google,  "Machine Learning: The High Interest Credit Card of Technical Debt."

Anyone who plans on building production mathematical modeling systems for a living needs to keep a copy of that paper close.

And while I don't want to recap the whole paper here, I want to highlight some pieces of it that hit close to home.


Read More
1 Comment

    Author

    Hey, I'm John, the data scientist at MailChimp.com.

    This blog is where I put thoughts about doing data science as a profession and the state of the "analytics industry" in general.

    Want to get even dirtier with data? Check out my blog "Analytics Made Skeezy", where math meets meth as fictional drug dealers get schooled in data science.

    Reach out to me on Twitter at @John4man

    Picture
    Click here to buy the most amazing spreadsheet book you've ever read (probably because you've never read one).

    Archives

    January 2015
    July 2014
    June 2014
    May 2014
    March 2014
    February 2014
    January 2014
    November 2013
    October 2013
    September 2013
    August 2013
    July 2013
    May 2013
    February 2013

    Categories

    All
    Advertising
    Big Data
    Data Science
    Machine Learning
    Shamelessly Plugging My Book
    Talent
    Talks

    RSS Feed


✕