John Foreman, Data Scientist
  • Home
  • Data Smart book
  • Speaking & Events
  • Featured Talks
  • Blog
  • MailChimp
Contact

The Forgotten Job of a Data Scientist: Editing

5/8/2014

4 Comments

 
I have made this [letter] longer than usual, because I have not had time to make it shorter. –Blaise Pascal

Within the arts, there has always been a tension between ornamentation and simplicity. The good artist is one who can execute a technique exceptionally, but the great artist is one who knows when to hold back.

I myself love a good grilled London Broil. But whenever I make it myself, it tastes like pencil erasers. I learned that I was adding way too much oregano to the marinade, and it was overpowering everything. If a little is good, how can more not be good-er?

It’s like Reinhardt said, “The more stuff in it, the busier the work of art, the worse it is. More is less. Less is more.”

Data science is a young occupation that could stand to take from these older pursuits. Whether it’s writing, cooking, or painting, editing is a core component of becoming a master of the discipline. Knowing when to hold back.


The same is true in analytics. Oftentimes, a data scientist can build a better model, a more complex model, a more accurate model. But that doesn’t mean they should.
Picture
Reinhardt's crazy minimalist painting consisting of black squares.
An article this week proclaimed, much to the data science community’s chagrin, that “most of a data scientist’s time is spent creating predictive models.” Forget about cleaning data, doing historical analyses that go into basic reports, etc. Apparently, the core job is predictive modeling. I fear for the company who hires any data scientist that believes that. Not only because they’re not going to get anything of practical worth done, but also because with that type of mindset would a data scientist ask one of the most important predictive modeling questions of all:

Do I really need to build this model? Can I do something simpler?

If your job is building models, all you do is try to build models. A data scientist’s job should be to assist the business using data regardless of whether that’s through predictive modeling, simulation, optimization modeling, data mining, visualization, or summary reporting.

In a business, predictive modeling and accuracy is a means, not an end. What’s better: A simple model that’s used, updated, and kept running? Or a complex model that works when you babysit it but the moment you move on to another problem no one knows what the hell it’s doing? 

Robert Holte argued simplicity versus accuracy in his rather interesting and thoroughly-titled paper “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets.” In that paper, the *air quotes* AI model he used was a single decision stump – one rule. In code, it’d basically be an IF statement. He simply looked through a training dataset and found the single best feature for splitting the data in reference to the response variable and used that to form his rule. The scandalous part of Holte’s approach: his results were pretty good!

And so, Holte raises this point, “simple-rule learning systems are often a viable alternative to systems that learn more complex rules. If a complex rule is induced, its additional complexity must be justified…”

Complexity must be justified. Stop and think about that a moment.

If I’m working on a high frequency trading model, additional complexity in an AI model might mean millions of dollars in revenue. And I’ve likely been hired to gold-plate an AI model where there is no other end user beyond myself and a few other PhDs.

But I don’t work in high frequency trading. I work at a much more fun company. That means that while some of my models are mission-critical, i.e. revenue is on the line and every bit of accuracy helps, other models are less important. 

Recently I needed to build a model that classified users into three groups: paid, likely-to-pay-in-the-future, and will-never-pay-us-a-dime. I could have built a highly complex model. There were many weak predictors at my disposal (“have you generated an API key in the first 30 days?”) that would have worked great in an ensemble model. But I stopped myself.

This was a model that no one would have the time to babysit. And our business wasn’t on the line. It just needed to be something that worked better than checking “free/paid” within an account.

Holte argues in his paper that often datasets used in predictive modeling have relatively few features that really stand out as predictors. Investigating my data, this was the case. There was one feature that was highly predictive of likely-to-pay-in-the-future. And there was a second feature that was almost as powerful. Other than that, the other features boosted accuracy marginally.

So rather than build a full-fledged AI model, I handed an engineer an IF statement: “Check this and this in a user’s account. If those two things are true, they’re likely to pay us in the future.”

Now, I still let the business know the FPR and TPR of this simple model. You’ve gotta let people know what they’re sacrificing by going with a simple approach. Because whether or not you increase complexity for additional accuracy is not a data science decision. It’s a business decision.

So keep in mind that as a data scientist you’re a technical person, but there’s also a bit of artistry in your job. You are an editor subject to the needs of the business.
4 Comments
rhuben
5/8/2014 15:27:14

even 1R would still require updating after sometime no? so it is mostly a tradeoff of how much time to develop vs. maintain vs. time to edit.

Also, how hard is it to create a feedback loop that'll update a complex model automatically?

Reply
John
5/9/2014 00:37:52

With 1R, I'd suspect that as new records came in, you'd have to retrain by evaluating whether these records shifted your rule to some other feature or breakpoint. Not altogether dissimilar from having to keep a whole CART or Random Forest up to date.

I kinda like naive bayes in this respect because as new records come in, you're shoving counts in a database and there's no model per se. Same with kNN. Makes "retraining" easy.

Reply
Louis Dorard link
5/23/2014 00:07:38

I think my point about creating predictive models was misinterpreted. I agree with you that it is a means, not an end, but it is at the center of what a data scientist does. Here are some more thoughts on this, hopefully it helps clarify my original message: www.louisdorard.com/blog/automating-the-data-scientist

You mention the trade off between simplicity and accuracy, and show that accuracy is not the ultimate objective, which is an excellent point. Sometimes, accuracy aligns perfectly with revenue. However, when things are not that simple, how do you measure the performance and impact of what it is you do with the model? Again, this is not a data science question but a business question.

This shows that a data scientist must have business acumen and that he should not just be a statistician who’s good at software engineering (or the other way around), as some may say.

Reply
Harry link
6/12/2014 23:39:37

It was pretty interesting to read this blog.
Good work
I liked the way you have narrated.

Reply

Your comment will be posted after it is approved.


Leave a Reply.

    Author

    Hey, I'm John, the data scientist at MailChimp.com.

    This blog is where I put thoughts about doing data science as a profession and the state of the "analytics industry" in general.

    Want to get even dirtier with data? Check out my blog "Analytics Made Skeezy", where math meets meth as fictional drug dealers get schooled in data science.

    Reach out to me on Twitter at @John4man

    Picture
    Click here to buy the most amazing spreadsheet book you've ever read (probably because you've never read one).

    Archives

    January 2015
    July 2014
    June 2014
    May 2014
    March 2014
    February 2014
    January 2014
    November 2013
    October 2013
    September 2013
    August 2013
    July 2013
    May 2013
    February 2013

    Categories

    All
    Advertising
    Big Data
    Data Science
    Machine Learning
    Shamelessly Plugging My Book
    Talent
    Talks

    RSS Feed


✕