John Foreman, Data Scientist
  • Home
  • Data Smart book
  • Speaking & Events
  • Featured Talks
  • Blog
  • MailChimp
Contact

Data is an Asset

9/14/2015

0 Comments

 
Last week, I read the article "Data is not an asset, it's a liability." Flip over and read it. It's short. 

Naturally, there are things in the post I don't agree with. Data writes my paychecks. And then I cash those checks. And spend that cash. My expenditures are paired with ad impression data to create a formula that feeds little baby data scientists. It's a circle of data. And it moves us all.
Picture
Originally, I was going to respond to this article in general, hand-wavy philosophical terms. But not one to pass up a juicy logical fallacy, I decided maybe it'd be fun to take on some of its points using a bunch of anecdotal evidence from my own job at MailChimp. 

I'm going to jump around in Karpinnen's post, responding to the points I care about.



Read More
0 Comments

Surviving Data Science "at the Speed of Hype"

1/30/2015

14 Comments

 
Being in the world of big data and data science, I see a lot of stuff like this: 

Analytics at the speed of big data:

#Infographic: 82x faster insights. That's analytics at the speed of big data. http://t.co/fCnPshuhVC

— Erika Riehle (@erikariehle) October 29, 2014
Computing at the speed of innovation:

[VIDEO] Computing at the speed of innovation: http://t.co/gqzCwsFqVm #BigData

— HP Discover (@HPDiscover) January 27, 2015
Big Data at the speed of light?

To Infinity and Beyond – #BigData at the speed of light with @SkytreeHQ and SPARK: http://t.co/90ahLUxcx7 #DataScience #MachineLearning

— Kirk Borne (@KirkDBorne) December 27, 2014
Big data at the speed of thought! Now that's more like it...

Analyse big data at the speed of thought and drive rapid innovation. Choose #SAP #BUSINESS - http://t.co/DpcguWbFca

— AlphaSirius (@AlphaSirius_SAP) December 17, 2014
And my personal favorite, big data....at the speed of big data.

"The @sgi_corp UV 2 really lets you explore #BigData at the speed of Big Data."- Kalev H. Leetaru, …http://t.co/ZNF7CBYWAD

— E4 (@e4company) November 21, 2014
There is this idea endemic to the marketing of data science that big data analysis can happen quickly, supporting an innovative and rapidly changing company. But in my experience and in the experience of many of the analysts I know, this marketing idea bears little resemblance to reality.

Over the course of my career, I've built optimization models for a number of businesses, some large, like Royal Caribbean or Coke, some smaller, like MailChimp circa 2012. And the one thing I've learned about optimization models, for example, is that as soon as you've "finished" coding and deploying your model the business changes right under your nose, rendering your model fundamentally useless. And you have to change the optimization model to address the new process. 

Once upon a time, I built a model for Dell that optimized the distribution of their chassis and monitors from China to their fulfillment centers in the U.S. Over and over again, my team worked on customizing our model to Dell's supply chain. The moment the project was over...Dell closed down a factory and jacked the formulation. Now, we had done some things to make the model robust in such scenarios (made factories a flexible set in the ILOG OPL code for example). But nonetheless, the model was messed up, and someone needed to fix it.

And this example was for a relatively large and stable company. Dell sure moves slower than, say, a tech startup. But with each passing year, the young, turbulent company seems more the norm than the old rigid enterprise. The speed at which businesses are changing is accelerating.

And most data science models that are of any degree of sophistication, require stability.

A good demand forecast might need several seasonal cycles of historical data.

A good optimization model requires an ironed out process (manufacturing, logistics, customer support, etc.).

A good predictive model requires a stable set of inputs with a predictable range of values that won't drift away from the training set. And the response variable needs to remain of organizational interest.

Process stability and "speed of BLAH" are not awesome bedfellows. Supervised AI models hate pivoting. When a business is changing a lot, that means processes get monkeyed with. Maybe customer support starts working in different shifts, maybe a new product gets released or prices are changed and that shifts demand from historical levels, or maybe your customer base changes to a younger demographic than your ML models have training data for targeting. 



Whatever the change may be, younger, smaller companies mean more turbulence and less opportunity for monolithic analytics projects.

And that is not primarily a tool problem.

A lot of vendors want to cast the problem as a technological one. That if only you had the right tools then your analytics could stay ahead of the changing business in time for your data to inform the change rather than lag behind it.

This is bullshit. As Kevin Hillstrom put it recently:

If IBM Watson can find hidden correlations that help your business, then why can't IBM Watson stem a 3 year sales drop at IBM?

— Kevin Hillstrom (@minethatdata) January 25, 2015
In other words, it's very hard for sophisticated analytics software and techniques running on "big data" to run out in front of your changing business and radically benefit it.

The most sophisticated analytics systems we have examples of run on stable problems. For example, ad targeting at Facebook and Google. This business model isn't changing much, and when it does, it's financially worth it to modify the model.

Airline scheduling. Oil exploration. High frequency trading.

For a model operating on these problems, the rules of the game are fairly established and the potential revenue gains/losses are substantial.

But what about forecasting demand for your new bedazzled chip clip on Etsy? What about predicting who's a fraudster lurking within your online marketplace? Is your business stable enough and the revenue potential high enough to keep someone constantly working on "analytics at the speed of big data" to use a model in this context? 

Analytics at the speed of meat and potatoes

You know what can keep up with a rapidly changing business?

Solid summary analysis of data. Especially when conducted by an analyst who's paying attention, can identify what's happening in the business, and can communicate their analysis in that chaotic context.

Boring, I know. But if you're a nomad living out of a yurt, you dig a hole, not a sewer system.

Simple analyses don't require huge models that get blown away when the business changes. Just yesterday I pulled a bunch of medians out of a system here at MailChimp. What is the median time it takes for 50% of a user's clicks to come in after they've sent an email campaign? I can pull that, I can communicate it. And I can even give some color commentary on why that value is important to our business. (It lends perspective to our default A/B test length for example.)

If you want to move at the speed of "now, light, big data, thought, stuff," pick your big data analytics battles. If your business is currently too chaotic to support a complex model, don't build one. Focus on providing solid, simple analysis until an opportunity arises that is revenue-important enough and stable enough to merit the type of investment a full-fledged data science modeling effort requires.


But how do I feel good about my graduate degree if all I'm doing is pulling a median?

If your goal is to positively impact the business, not to build a clustering algorithm that leverages storm and the Twitter API, you'll be OK.
14 Comments

The Perilous World of Machine Learning for Fun and Profit: Pipeline Jungles and Hidden Feedback Loops

1/5/2015

1 Comment

 
I haven't written a blog post in ages. And while I don't want to give anything away, the main reason I haven't been writing is that I've been too busy doing my day job at MailChimp. The data science team has been working closely with others at the company to do some fun things in the coming year.

That said, I got inspired to write a quick post by this excellent short paper out of Google,  "Machine Learning: The High Interest Credit Card of Technical Debt."

Anyone who plans on building production mathematical modeling systems for a living needs to keep a copy of that paper close.

And while I don't want to recap the whole paper here, I want to highlight some pieces of it that hit close to home.


Read More
1 Comment

Why OkCupid Isn't FaceBook. Not All Experiments Feel the Same

7/29/2014

6 Comments

 
When the news broke that Facebook experimented on users' emotions by manipulating their feeds, one of the most common defenses of Facebook that I heard was, "But all websites experiment on their users!"

Yesterday, we saw Christian Rudder at OkCupid, rather colorfully, make this facile argument. OkCupid has a history of experimenting with its matching algorithm and selections (and hence with users).

"But guess what, everybody: if you use the internet, you're the subject of hundreds of experiments at any given time, on every site. That's how websites work," says Rudder. 

And despite his finger-wagging, dick-like phrasing that only entrepreneurs can achieve, Rudder is correct to a point. 

Read More
6 Comments

Why is Big Data Creepy? And What Can We Do?

7/24/2014

1 Comment

 
I'm going to start off this post with a film clip. It's 4 minutes long, but I hope you'll watch it. The scene is from my favorite film ever, Sneakers, and in this clip, Robert Redford and his crack team of penetration testers root through a guy's trash.

Whose trash are they rooting through? A character named Werner Brandes.
Why are they rooting through his trash? They want to learn about Werner, his personality, his routine, his weaknesses. Because Robert Redford wants to find a way to get close to Werner and exploit him to break into his workplace.

Let's watch.

Read More
1 Comment

Facebook's solution to big data's "content problem:" dumber users

6/29/2014

4 Comments

 
In the early days of cinema, Soviet filmmakers were fascinated with film editing, i.e. placing shots in an arranged order. One of these filmmakers (and possibly the first film theorist), Lev Kuleshov, proposed that the emotional power of cinema lay not in the images themselves but in the way they were edited together.

For Kuleshov, the sequential juxtaposition of content lends meaning to images that may have nothing to do with each other. 

And he conducted an experiment, the so-called Kuleshov effect, to highlight this principle. Kuleshov took a clip of Ivan Ilyich Mozzhukhin (the Ryan Gosling of Tsarist Russia) staring at the camera and intercut it with some other images: a bowl of soup, a girl in a coffin, an attractive woman. When he showed this sequence to an audience, the viewers noted how the emotional state of the actor changed from cut to cut. He's hungry. He's sad. He's got the hots for that lady.

The audience praised Mozzhukhin's emotive performance. But the actor's stare that Kuleshov used was the same in each cut. Here's an example of the effect:


Read More
4 Comments

Data Science Hearts User Experience

5/23/2014

0 Comments

 
I feel bad for data scientists that have gotten stuck on ad targeting. 

When your job is to increase engagement on something you've shoved in someone's stream or in someone's search results, then you're effectively pissing against the UX wind. That can't feel good. Your job is antagonistic to your users' goals (because your users aren't your customers).


Read More
0 Comments

The Forgotten Job of a Data Scientist: Editing

5/8/2014

4 Comments

 
I have made this [letter] longer than usual, because I have not had time to make it shorter. –Blaise Pascal

Within the arts, there has always been a tension between ornamentation and simplicity. The good artist is one who can execute a technique exceptionally, but the great artist is one who knows when to hold back.

I myself love a good grilled London Broil. But whenever I make it myself, it tastes like pencil erasers. I learned that I was adding way too much oregano to the marinade, and it was overpowering everything. If a little is good, how can more not be good-er?


Read More
4 Comments

A Live Introduction to Data Science: Naive Bayes and Holt-Winters Forecasting

3/12/2014

1 Comment

 
A month ago I spoke at Strata Conf in Santa Clara. Unlike my usual talks, this one was a 3 hour tutorial taken from my book, Data Smart. Specifically, I spent 3 hours doing chapter 3 and an abbreviated chapter 8. On the whole, I think it was a useful session, so I've posted it online so anybody can follow along.

Chapter 3 is an introduction to supervised machine learning via naive Bayes. And Chapter 8 is demand forecasting using triple exponential smoothing (Holt-Winters).

The first 5 minutes of the first video have garbled audio, but the tech fixes it before too long. My apologies.

To follow along, download the spreadsheets for chapters 3 and 8 from the downloads section of the book's website:
http://www.wiley.com/WileyCDA/WileyTitle/productCd-111866146X.html

Direct links:
http://media.wiley.com/product_ancillary/6X/11186614/DOWNLOAD/ch03.zip
http://media.wiley.com/product_ancillary/6X/11186614/DOWNLOAD/ch08.zip

You'll need to unzip these spreadsheets and clear some data out of them per my instructions in the talk. Also, you'll need access to spreadsheet software.

PART 1:
You're doing great, hang in there! Part 2:
Can you feel the knowledge washing over you? Part 3:
One lap to go!!! Part 4:
1 Comment

The $30/hr Data Scientist

3/6/2014

11 Comments

 
Yesterday a journalist asked me to comment on Vincent Granville's post about the $30/hr data scientist for hire on Elance. What started as a quick reply in an email, spiraled a bit, so I figured I'd post the entire reply here to get your thoughts in the comments.

When we ask the question, "Can someone do what a data scientist does for $30/hr?" we first need to answer the question, "What does a data scientist do?" And there are a multitude of answers to that question. 


Read More
11 Comments
<<Previous

    Author

    Hey, I'm John, the data scientist at MailChimp.com.

    This blog is where I put thoughts about doing data science as a profession and the state of the "analytics industry" in general.

    Want to get even dirtier with data? Check out my blog "Analytics Made Skeezy", where math meets meth as fictional drug dealers get schooled in data science.

    Reach out to me on Twitter at @John4man

    Picture
    Click here to buy the most amazing spreadsheet book you've ever read (probably because you've never read one).

    Archives

    January 2015
    July 2014
    June 2014
    May 2014
    March 2014
    February 2014
    January 2014
    November 2013
    October 2013
    September 2013
    August 2013
    July 2013
    May 2013
    February 2013

    Categories

    All
    Advertising
    Big Data
    Data Science
    Machine Learning
    Shamelessly Plugging My Book
    Talent
    Talks

    RSS Feed


✕