John Foreman, Data Scientist
  • Home
  • Data Smart book
  • Speaking & Events
  • Featured Talks
  • Blog
  • MailChimp
Contact

Surviving Data Science "at the Speed of Hype"

1/30/2015

14 Comments

 
Being in the world of big data and data science, I see a lot of stuff like this: 

Analytics at the speed of big data:

#Infographic: 82x faster insights. That's analytics at the speed of big data. http://t.co/fCnPshuhVC

— Erika Riehle (@erikariehle) October 29, 2014
Computing at the speed of innovation:

[VIDEO] Computing at the speed of innovation: http://t.co/gqzCwsFqVm #BigData

— HP Discover (@HPDiscover) January 27, 2015
Big Data at the speed of light?

To Infinity and Beyond – #BigData at the speed of light with @SkytreeHQ and SPARK: http://t.co/90ahLUxcx7 #DataScience #MachineLearning

— Kirk Borne (@KirkDBorne) December 27, 2014
Big data at the speed of thought! Now that's more like it...

Analyse big data at the speed of thought and drive rapid innovation. Choose #SAP #BUSINESS - http://t.co/DpcguWbFca

— AlphaSirius (@AlphaSirius_SAP) December 17, 2014
And my personal favorite, big data....at the speed of big data.

"The @sgi_corp UV 2 really lets you explore #BigData at the speed of Big Data."- Kalev H. Leetaru, …http://t.co/ZNF7CBYWAD

— E4 (@e4company) November 21, 2014
There is this idea endemic to the marketing of data science that big data analysis can happen quickly, supporting an innovative and rapidly changing company. But in my experience and in the experience of many of the analysts I know, this marketing idea bears little resemblance to reality.

Over the course of my career, I've built optimization models for a number of businesses, some large, like Royal Caribbean or Coke, some smaller, like MailChimp circa 2012. And the one thing I've learned about optimization models, for example, is that as soon as you've "finished" coding and deploying your model the business changes right under your nose, rendering your model fundamentally useless. And you have to change the optimization model to address the new process. 

Once upon a time, I built a model for Dell that optimized the distribution of their chassis and monitors from China to their fulfillment centers in the U.S. Over and over again, my team worked on customizing our model to Dell's supply chain. The moment the project was over...Dell closed down a factory and jacked the formulation. Now, we had done some things to make the model robust in such scenarios (made factories a flexible set in the ILOG OPL code for example). But nonetheless, the model was messed up, and someone needed to fix it.

And this example was for a relatively large and stable company. Dell sure moves slower than, say, a tech startup. But with each passing year, the young, turbulent company seems more the norm than the old rigid enterprise. The speed at which businesses are changing is accelerating.

And most data science models that are of any degree of sophistication, require stability.

A good demand forecast might need several seasonal cycles of historical data.

A good optimization model requires an ironed out process (manufacturing, logistics, customer support, etc.).

A good predictive model requires a stable set of inputs with a predictable range of values that won't drift away from the training set. And the response variable needs to remain of organizational interest.

Process stability and "speed of BLAH" are not awesome bedfellows. Supervised AI models hate pivoting. When a business is changing a lot, that means processes get monkeyed with. Maybe customer support starts working in different shifts, maybe a new product gets released or prices are changed and that shifts demand from historical levels, or maybe your customer base changes to a younger demographic than your ML models have training data for targeting. 



Whatever the change may be, younger, smaller companies mean more turbulence and less opportunity for monolithic analytics projects.

And that is not primarily a tool problem.

A lot of vendors want to cast the problem as a technological one. That if only you had the right tools then your analytics could stay ahead of the changing business in time for your data to inform the change rather than lag behind it.

This is bullshit. As Kevin Hillstrom put it recently:

If IBM Watson can find hidden correlations that help your business, then why can't IBM Watson stem a 3 year sales drop at IBM?

— Kevin Hillstrom (@minethatdata) January 25, 2015
In other words, it's very hard for sophisticated analytics software and techniques running on "big data" to run out in front of your changing business and radically benefit it.

The most sophisticated analytics systems we have examples of run on stable problems. For example, ad targeting at Facebook and Google. This business model isn't changing much, and when it does, it's financially worth it to modify the model.

Airline scheduling. Oil exploration. High frequency trading.

For a model operating on these problems, the rules of the game are fairly established and the potential revenue gains/losses are substantial.

But what about forecasting demand for your new bedazzled chip clip on Etsy? What about predicting who's a fraudster lurking within your online marketplace? Is your business stable enough and the revenue potential high enough to keep someone constantly working on "analytics at the speed of big data" to use a model in this context? 

Analytics at the speed of meat and potatoes

You know what can keep up with a rapidly changing business?

Solid summary analysis of data. Especially when conducted by an analyst who's paying attention, can identify what's happening in the business, and can communicate their analysis in that chaotic context.

Boring, I know. But if you're a nomad living out of a yurt, you dig a hole, not a sewer system.

Simple analyses don't require huge models that get blown away when the business changes. Just yesterday I pulled a bunch of medians out of a system here at MailChimp. What is the median time it takes for 50% of a user's clicks to come in after they've sent an email campaign? I can pull that, I can communicate it. And I can even give some color commentary on why that value is important to our business. (It lends perspective to our default A/B test length for example.)

If you want to move at the speed of "now, light, big data, thought, stuff," pick your big data analytics battles. If your business is currently too chaotic to support a complex model, don't build one. Focus on providing solid, simple analysis until an opportunity arises that is revenue-important enough and stable enough to merit the type of investment a full-fledged data science modeling effort requires.


But how do I feel good about my graduate degree if all I'm doing is pulling a median?

If your goal is to positively impact the business, not to build a clustering algorithm that leverages storm and the Twitter API, you'll be OK.
14 Comments
Charles H Martin, PhD link
1/30/2015 02:33:44

Dead On

Reply
Vijay
1/30/2015 02:45:44

Makes sense. Well put.

Reply
Paul
1/30/2015 08:37:52

John, I work as a data scientist at a manufacturing company, and I cannot help but agree with you. There is a place for complex models of course. But the numbers that actually translate into business impact often do not come from complex models but summary statistics. It's more important to know how to group, pivot and summarize data and show simple trends over time, than it is to build a KNN model.

Reply
Josh Andrews link
1/30/2015 09:36:17

I couldn't agree more. I also think that a lot of Big Data is about people a) being lazy and not modeling their data ahead of time and b) just avoiding process and procedure that is there for a reason.

Everyone is crazy about "schema-on-read" but the problem with that is nobody is going to go back and clean up the schema in the data itself afterwards. The schema is going to be scattered through a bunch of application code. Nobody is going to be able to use the data after the folks who wrote the app code up and leave or die or whatever. It will not be maintainable and you will have 100 TB of messy text files in your "data lake" that nobody understands.

Reply
J. Cleaver
1/30/2015 14:22:34

Big Data and statistical analysis are not a substitute for clear thinking, let alone critical thinking.

TBH, it really feels like just that Developers/Operations, or Programmers/SysAdmins split, merely dressed up in different clothing...

Reply
Peter
1/30/2015 17:23:56

This is the first time I've heard of optimization models. Where can I go to understand them (preferably with applied examples)?

The top Google results give me the theoretical what, not the practical why (e.g. why would MailChimp need optimization models? Is it to streamline sending email, or handling support, or..?)

Reply
Sankar link
1/30/2015 23:22:45

From relatively smaller amounts of public data in our experience, we could get interesting and meaningful signals/insights. It depends on the available data and quality. There are cases where we have to get rid of a huge amount of data (big data noise) to derive meaningful data typically smaller in size. Analytics at the speed of thought in our case is dependent on data sources and quality..

Reply
Paul B. Felix link
1/31/2015 00:44:32

Nice post. Totally agree.

Reply
Vaclav link
2/1/2015 21:26:17

Hi John. I totally agree, great post!

Reply
Roy Bellman link
2/2/2015 22:50:17

Long overdue discussion. Thank you for your "let's get real" observations.

Reply
Carla Gentry link
2/4/2015 20:27:25

Finally! An honest blog about big data, hats off!!!!! Love it :o)

Reply
clay ashley link
2/5/2015 21:11:02

AI will solve this! Just wait and see!!

Reply
Rafael Barbolo link
2/9/2015 01:17:37

Best quote ever: "Big Data at the speed of Big Data".

Reply
Patrick
2/10/2015 14:37:34

In a crowded room - the loudest person yelling is the only one. Same with Big Data marketing and its just that marketing. Each company is trying to push forward their value prop. I've never been in an actual vendor to customer discussion (and I've been on both sides of that discussion in the last 7 years) that either side fully stood behind the full marketing hype. Anyone worth their salt knows data projects have three components - technology, people and process, and technology is usually the easiest to deal with.

Reply

Your comment will be posted after it is approved.


Leave a Reply.

    Author

    Hey, I'm John, the data scientist at MailChimp.com.

    This blog is where I put thoughts about doing data science as a profession and the state of the "analytics industry" in general.

    Want to get even dirtier with data? Check out my blog "Analytics Made Skeezy", where math meets meth as fictional drug dealers get schooled in data science.

    Reach out to me on Twitter at @John4man

    Picture
    Click here to buy the most amazing spreadsheet book you've ever read (probably because you've never read one).

    Archives

    January 2015
    July 2014
    June 2014
    May 2014
    March 2014
    February 2014
    January 2014
    November 2013
    October 2013
    September 2013
    August 2013
    July 2013
    May 2013
    February 2013

    Categories

    All
    Advertising
    Big Data
    Data Science
    Machine Learning
    Shamelessly Plugging My Book
    Talent
    Talks

    RSS Feed


✕