<![CDATA[John Foreman, Data Scientist - Blog]]>Mon, 14 Sep 2015 07:52:40 -0700Weebly<![CDATA[Data is an Asset]]>Mon, 14 Sep 2015 14:52:37 GMThttp://www.john-foreman.com/blog/data-is-an-assetLast week, I read the article "Data is not an asset, it's a liability." Flip over and read it. It's short. 

Naturally, there are things in the post I don't agree with. Data writes my paychecks. And then I cash those checks. And spend that cash. My expenditures are paired with ad impression data to create a formula that feeds little baby data scientists. It's a circle of data. And it moves us all.
Originally, I was going to respond to this article in general, hand-wavy philosophical terms. But not one to pass up a juicy logical fallacy, I decided maybe it'd be fun to take on some of its points using a bunch of anecdotal evidence from my own job at MailChimp. 

I'm going to jump around in Karpinnen's post, responding to the points I care about.

"Old data usually isn’t very interesting"

In the blog post, the author seems to qualify "old data" as something that's not happening right now; for example, data that's a year old.

This doesn't match my experience. Many of the products I've created at MailChimp are powered on ostensibly "old" data. For example, our antiabuse model, Omnivore, runs on years' worth of data to evaluate list uploads from users prior to content creation or sending. And while current data is quite valuable for these models, old data is equally valuable. If a user uploads a list of email addresses that we haven't sent to in a year (and we send over 15 billion emails each month),
there's a staleness there that's alarming.

Let me give a more recent example than Omnivore: MailChimp Pro. We're releasing it this week. Pro is a collection of data-centric features; things like multivariate testing, user credit scoring, and post-send data mining.

MailChimp Pro was conceived of and designed by analyzing years of account data and survey data. Hell, we even used this historical data (things like exit surveys and on-boarding data) to understand price sensitivity in order to price MailChimp Pro correctly (friends don't let friends cost-based-price).

The default timing and sample size values placed in MailChimp Pro's Multivariate Testing feature came from studying years of A/B tests sent through our system. It's a good thing we didn't throw that data away or start collecting it only when we decided to build multivariate. Similarly, our Pro Compliance Insight tool, which functions similarly to a credit score, is powered off of a huge amount of historical marketing performance data. 

"Sure, spotting trends in historical data might be cool, but in all likelihood it isn’t actionable"

I would contend the opposite is true. Spotting trends in historical data is often actionable even though it is not cool. What am I thinking about? 


At MailChimp, my team does a lot of forecasting using historical data. And a good forecast that's going to separate out things like seasonality and trend often demands years of historical data. This process of producing demand forecasts rarely feels cool, but the business certainly takes action on it. Investments in people and infrastructure are often driven by forecasts. 

And if you get these decisions wrong because you don't collect data and instead operate on your gut, you might end up hiring a bunch of people you don't need.

"You start with the questions you want answered. Then you collect the data you need (and just the data you need) to answer those questions."

There's some truth here. It is always important to lead data science and data product efforts with specific questions and goals. Too many data projects have failed because the only goal was "if you get it in hadoop, they will come."

At the same time though, you can't wait until you know what you're going to do when it comes to data collection. What if the problem you want to solve requires lots of historical data (many ML models do, including the examples I gave above)? 

My team has a data product in alpha that's powered by years and years of list subscription data. And all of it is important for our ML models. We started building this particular model around the new year. What if we'd waited until then to collect list subscription data? We'd have had to wait for years to produce the accuracy we got immediately, because we'd already been collecting data.

"When the amount of data gets truly big, so do the problems in managing it all."

Sure, if by "managing," the author means "using in production" where things like speed and availability matter. But collection isn't terribly hard (yay log files!). So go ahead and start collecting those large datasets in all of their terribly-formatted glory, but don't start using that data until you know what you want to do. That way you can be choosy and pull a smaller, easier to use set if you need to. 

"You can’t expect the value of data to just appear out of thin air. Data isn’t fissile material. It doesn’t spontaneously reach critical mass and start producing insights."

The first part of this point is right. The value of data doesn't just appear out of thin air. It's given value via its usage. And that requires intentionality and planning and good ideas (shouldn't stop you from collection before that point!).

That said, in my experience, there does exist a point in certain scenarios where data reaches a "critical mass." And that critical mass often has to do with coverage. A gas station may be hard-pressed to predict a whole lot about you based solely on the snacks you buy. They only see a small fraction of your purchases. Especially compared to something like Facebook + Acxiom. Their data set covers so many purchases and ad impressions that they can engage in modeling efforts your local gas station can't touch.

I've seen this at MailChimp. Three years ago, we didn't have enough eCommerce360
 data to conduct a whole lot of ecomm modeling. But with the success of our many shopping cart integrations and the growth of our company in general, our accumulated ecommerce data set has reached a critical mass. All of a sudden, value has appeared out of thin air; in other words, I can entertain modeling and product ideas that use the data that I never would have entertained three years ago when user coverage and historical depth hadn't built up yet.

Data is an asset

Like I said, I'm biased, because data is my life and livelihood. But I've seen company after company, my own, other tech companies, Fortune 500s, hell even the government, benefit from collecting and using large, oldish datasets.

If you're on the fence, I'd recommend finding a cheap, lazy way to store data that you suspect might be interesting to your company later. It doesn't have to be stored all structured and beautiful. It doesn't need to be accessed quickly. Hell, pull out that old Jaz drive and shove text files onto it.

But don't let down future-you once your business has changed or grown when, all of a sudden, that dusty data serves a purpose. If you don't start collecting until you know precisely what you want, I hope you're in the time machine business. 

<![CDATA[Surviving Data Science "at the Speed of Hype"]]>Fri, 30 Jan 2015 17:19:52 GMThttp://www.john-foreman.com/blog/surviving-data-science-at-the-speed-of-hype Being in the world of big data and data science, I see a lot of stuff like this: 

Analytics at the speed of big data:
Computing at the speed of innovation:
Big Data at the speed of light?
Big data at the speed of thought! Now that's more like it...
And my personal favorite, big data....at the speed of big data.
There is this idea endemic to the marketing of data science that big data analysis can happen quickly, supporting an innovative and rapidly changing company. But in my experience and in the experience of many of the analysts I know, this marketing idea bears little resemblance to reality.

Over the course of my career, I've built optimization models for a number of businesses, some large, like Royal Caribbean or Coke, some smaller, like MailChimp circa 2012. And the one thing I've learned about optimization models, for example, is that as soon as you've "finished" coding and deploying your model the business changes right under your nose, rendering your model fundamentally useless. And you have to change the optimization model to address the new process. 

Once upon a time, I built a model for Dell that optimized the distribution of their chassis and monitors from China to their fulfillment centers in the U.S. Over and over again, my team worked on customizing our model to Dell's supply chain. The moment the project was over...Dell closed down a factory and jacked the formulation. Now, we had done some things to make the model robust in such scenarios (made factories a flexible set in the ILOG OPL code for example). But nonetheless, the model was messed up, and someone needed to fix it.

And this example was for a relatively large and stable company. Dell sure moves slower than, say, a tech startup. But with each passing year, the young, turbulent company seems more the norm than the old rigid enterprise. The speed at which businesses are changing is accelerating.

And most data science models that are of any degree of sophistication, require stability.

A good demand forecast might need several seasonal cycles of historical data.

A good optimization model requires an ironed out process (manufacturing, logistics, customer support, etc.).

A good predictive model requires a stable set of inputs with a predictable range of values that won't drift away from the training set. And the response variable needs to remain of organizational interest.

Process stability and "speed of BLAH" are not awesome bedfellows. Supervised AI models hate pivoting. When a business is changing a lot, that means processes get monkeyed with. Maybe customer support starts working in different shifts, maybe a new product gets released or prices are changed and that shifts demand from historical levels, or maybe your customer base changes to a younger demographic than your ML models have training data for targeting. 

Whatever the change may be, younger, smaller companies mean more turbulence and less opportunity for monolithic analytics projects.

And that is not primarily a tool problem.

A lot of vendors want to cast the problem as a technological one. That if only you had the right tools then your analytics could stay ahead of the changing business in time for your data to inform the change rather than lag behind it.

This is bullshit. As Kevin Hillstrom put it recently:
In other words, it's very hard for sophisticated analytics software and techniques running on "big data" to run out in front of your changing business and radically benefit it.

The most sophisticated analytics systems we have examples of run on stable problems. For example, ad targeting at Facebook and Google. This business model isn't changing much, and when it does, it's financially worth it to modify the model.

Airline scheduling. Oil exploration. High frequency trading.

For a model operating on these problems, the rules of the game are fairly established and the potential revenue gains/losses are substantial.

But what about forecasting demand for your new bedazzled chip clip on Etsy? What about predicting who's a fraudster lurking within your online marketplace? Is your business stable enough and the revenue potential high enough to keep someone constantly working on "analytics at the speed of big data" to use a model in this context? 

Analytics at the speed of meat and potatoes

You know what can keep up with a rapidly changing business?

Solid summary analysis of data. Especially when conducted by an analyst who's paying attention, can identify what's happening in the business, and can communicate their analysis in that chaotic context.

Boring, I know. But if you're a nomad living out of a yurt, you dig a hole, not a sewer system.

Simple analyses don't require huge models that get blown away when the business changes. Just yesterday I pulled a bunch of medians out of a system here at MailChimp. What is the median time it takes for 50% of a user's clicks to come in after they've sent an email campaign? I can pull that, I can communicate it. And I can even give some color commentary on why that value is important to our business. (It lends perspective to our default A/B test length for example.)

If you want to move at the speed of "now, light, big data, thought, stuff," pick your big data analytics battles. If your business is currently too chaotic to support a complex model, don't build one. Focus on providing solid, simple analysis until an opportunity arises that is revenue-important enough and stable enough to merit the type of investment a full-fledged data science modeling effort requires.

But how do I feel good about my graduate degree if all I'm doing is pulling a median?

If your goal is to positively impact the business, not to build a clustering algorithm that leverages storm and the Twitter API, you'll be OK.
<![CDATA[The Perilous World of Machine Learning for Fun and Profit: Pipeline Jungles and Hidden Feedback Loops]]>Mon, 05 Jan 2015 16:14:40 GMThttp://www.john-foreman.com/blog/the-perilous-world-of-machine-learning-for-fun-and-profit-pipeline-jungles-and-hidden-feedback-loopsI haven't written a blog post in ages. And while I don't want to give anything away, the main reason I haven't been writing is that I've been too busy doing my day job at MailChimp. The data science team has been working closely with others at the company to do some fun things in the coming year.

That said, I got inspired to write a quick post by this excellent short paper out of Google,  "Machine Learning: The High Interest Credit Card of Technical Debt."

Anyone who plans on building production mathematical modeling systems for a living needs to keep a copy of that paper close.

And while I don't want to recap the whole paper here, I want to highlight some pieces of it that hit close to home.

Pipeline Jungles

PictureGeorge prototyping a machine learning model.
There was a time as a boy when my favorite book was George's Marvelous Medicine by Roald Dahl. The book is full of all that mischief and malice that makes Dahl books so much fun. 

In the book, George wanders around his house finding chemicals to mix up into a brown soup to give to his grandmother in place of her normal medicine. And reading this bit of felony grand-matricide as a child always made me smile.

Prototyping a new machine learning model is like George's quest for toxic chemicals. It's a chance for the data scientist to root around their company looking for data sources and engineering features that help predict an outcome.

A little bit of these log files. A dash of Google Analytics data. Some of Marge-from-Accounting's spreadsheet.

POOF! We have a marvelous model.

How fun it is to regale others with tales of how you found that a combination of reddit upvotes, the lunar calendar, and the number of times your yoga instructor says FODMAPs is actually somewhat predictive!

But now it's the job of some poor sucker dev to take your prototype model, which pulls from innumerable sources (hell, you probably scraped Twitter too just for good measure), and turn it into a production system.

All of a sudden there's a "pipeline jungle," a jumbled up stream of data sources and glue code for feature engineering and combination, to create something programmatically and reliably in production that you only had to create once manually in your George's-Marvelous-Medicine-revelry.

It's easy in the research and design phase of a machine learning project to over-engineer the product. Too many data sources, too many exotic and brittle features, and as a corollary, too complex a model. One trap the paper points out is leaving in low powered features in your prototype model, because well, they help a little, and they're not hurting anyone right? 

What's the value of those features versus the cost of leaving them in? That's extra code to maintain, maybe an extra source to pull from. And as the Google paper notes, the world changes, data changes, and every model feature is a potential risk for breaking everything.

Remember, the tech press (and vendors) would have you build a deep learning model that's fed scraped data from the internet's butthole, but it's important to exercise a little self-control. As the authors of the technical debt paper put it, "Research solutions that provide a tiny accuracy benefit at the cost of massive increases in system complexity are rarely wise practice." Preach.

Who's going to own this model and care for it and love it and feed it and put a band-aid on its cuts when its decision thresholds start to drift? Since it's going to cost money in terms of manpower and tied up resources to maintain this model, what is the worth of this model to the business? If it's not that important of a model (and by important, I'm usually talking top line revenue), then maybe a logistic regression with a few interactions will do you nicely.

Humans are Feedback Loop Machines

PictureThe graveyard at Haworth
In Haworth England, they used to bury bodies at the top of a hill above the town. When someone died, they got carted up to the overcrowded graveyard and then their choleraic juices would seep into the water supply and infect those down below, creating more bodies for the graveyard.

Haworth had a particularly nasty feedback loop.

Machine learning models suck up all sorts of nasty dead body water too.

At MailChimp, if I know that a user is going to be a spammer in the future, I can shut them down now using a machine learning model and a swift kick to the user's derriere.

But that future I'm changing will someday, maybe next week, maybe next year, be the machine learning system's present day.

And any attempt to train on present day data, data which has now been polluted by the business's model-driven actions (dead spammers buried at the top of the hill), is fraught with peril. It's a feedback loop. All of a sudden, maybe I don't have any spammers to train my ML model on, because I've shut them all down. And now my newly trained model thinks spamming is more unlikely than I know it to be.

Of course, such feedback loops can be mitigated in many ways. Holdout sets for example.

But we can only mitigate a feedback loop if we know about its existence, and we as humans are awesome at generating feedback loops and terrible at recognizing them. 

Think about time-travel in fiction. Once you have a time machine (and make no mistake, a well-suited ML model is pretty close to a forward-leaping time machine when it comes to things like sales and marketing), it's easy to jump through time and monkey with events, but it's hard to anticipate all the consequences of those changes and how they might alter your future training data.

And yet when the outputs of ML models are put in the hands of others to act on, you can bet that the future (and the future pool of training data with it) will be altered. That's the point! I don't predict spammers to do nothing about them! Predictions are meant to be acted upon.

And so, when the police predict that a community is full of criminals and then they start harassing that community, what do you think is going to happen? The future training data gets affected by the police's "special attention." Predictive modeling feeds back into systematic discrimination.

But we shouldn't expect cops to understand that they're burying their dead at the top of the hill.

This is one of my fears with the pedestrianization of data science techniques. As we put predictive models more and more in the hands of the layperson, have we considered that we might cut anyone out of the loop who even understands or cares about their misuse?

Get Integrated, Stay Alert

The technical debt paper makes this astute observation, "It’s worth noting that glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated 'research' and 'engineering' roles."

This is absolutely true. When data scientists treat production implementation as a black box they shove their prototypes through and when engineers treat ML packages as black boxes they shove data pipelines through, problems abound.

Mathematical modelers need to stay close to engineers when building production data systems. Both need to keep each other in mind and keep the business in mind. The goal is not to use deep learning. The goal is not to program in Go. The goal is to create a system for the business that lives on. And in that context, accuracy, maintainability, sturdiness...they all hold equal weight.

So as a data scientist keep your stats buds close and your colleagues from other teams (engineers, MBAs, legal, ....) closer with the goal of getting work done together. It's the only way your models will survive past prototype.
<![CDATA[Why OkCupid Isn't FaceBook. Not All Experiments Feel the Same]]>Tue, 29 Jul 2014 09:40:53 GMThttp://www.john-foreman.com/blog/why-okcupid-isnt-facebook-not-all-experiments-feel-the-sameWhen the news broke that Facebook experimented on users' emotions by manipulating their feeds, one of the most common defenses of Facebook that I heard was, "But all websites experiment on their users!"

Yesterday, we saw Christian Rudder at OkCupid, rather colorfully, make this facile argument. OkCupid has a history of experimenting with its matching algorithm and selections (and hence with users).

"But guess what, everybody: if you use the internet, you're the subject of hundreds of experiments at any given time, on every site. That's how websites work," says Rudder. 

And despite his finger-wagging, dick-like phrasing that only entrepreneurs can achieve, Rudder is correct to a point. 
Any website run by people who care even a little will conduct experiments on its users. The most common form of experiment would be the A/B test. Indeed, whole companies have cropped up to help websites create multiple versions of their product, see which version evokes the best response from users (measured in online engagement, user sign-ups, e-commerce revenue, etc.), and choose the winning design as the final version of the product before repeating the cycle all over again. 

People A/B test all the time. For example, the Obama campaign streamlined the donation process via a series of experiments on their donors. And that was for the 2008 election! If this kind of experiment were such a scandal, surely political opponents would have jumped on it.

But people didn't get upset about the Obama campaign's test, or indeed, the vast majority of online experiments. Rudder is right.

That, however, doesn't mean that everyone upset by Facebook is wrong. Because not all experiments are the same. And it comes down to user expectations and incentives.

PictureThe Love Fairy
In the case of a dating site like OkCupid, users sign up for the product because of the matching algorithm. They understand that a company, not the love fairy, is assigning matches for users. And it's reasonable to assume that the dating site would continue to play with and tweak its matching algorithm, ostensibly for the purpose of improving it. After all, what's a dating site but a user interface slapped on a matching algorithm with some critical mass of users?

So, while Rudder is being intentionally flippant in his blog post, what OkCupid has done is rather in line with user expectations.

What Facebook did however is not. Why? Because users expect that their feeds are more or less natural representations of their friends' worlds. That's what Facebook pushes. When Facebook released their 10th anniversary "look back" videos, the point was, "Hey, your life and your friends' lives are documented here in a beautiful, pure, holistic, almost documentary way." But then when users learned that their emotions had been toyed with, suddenly, their expectations were defied. 

Sure, it was an experiment, but the experiment ran contrary to what people assumed was the purpose of the product.

And this leads us to incentives. Facebook's experiment, i.e. to prove that people's emotions can be compromised (to tailor your emotional state for an ad rather than to target you with an ad),  in actuality did not run contrary to the purpose of the product. Because the purpose of the product is not to allow people to experience life online or whatever vague, bubbly BS the look back videos would have users believe. The purpose of Facebook is to display ads to its users, using its vast dataset to improve the reach and effectiveness of those ads.

Through Facebook's experiment, people realized that they were using a product owned by a company whose incentives were vastly different from their own. And this discordance was uncomfortable. The experiment signaled the direction in which Facebook might permanently move -- a direction in which the already artificial experience of life via an EdgeRank-directed feed becomes more artificial as Facebook plays with that feed for the purpose of aligning your emotional state to its ad content.

On the flip side, whether OkCupid admits it or not, their incentives (and hence their experiments) are aligned with user goals. A dating site that advertises itself as one that maintains a subpar matching algorithm is a site that'll likely go out of business quickly. The company is incentivized ultimately to improve their matching algorithm, even if that means making it worse for a few users in the short term in order to run an experiment. Sure, some of OkCupid's experiments seem audacious, but the purpose was to better understand the factors that lead to successful matches on the dating website. And users get that.

I don't think most folks would have all websites cease testing their products on their users. That iterative improvement on user feedback is invaluable. Instead, what companies should consider is whether their tests are in line with user expectations and whether those tests serve to improve the user experience of the product. If there's a disconnect, then that should give a company pause.

This article is an experiment. Leave your comments in the comment section, and if you're not a bot trying to sell weight loss pills, who knows, maybe I'll change what I say based on your feedback.

<![CDATA[Why is Big Data Creepy? And What Can We Do?]]>Thu, 24 Jul 2014 18:50:10 GMThttp://www.john-foreman.com/blog/why-is-big-data-creepy-and-what-can-we-doI'm going to start off this post with a film clip. It's 4 minutes long, but I hope you'll watch it. The scene is from my favorite film ever, Sneakers, and in this clip, Robert Redford and his crack team of penetration testers root through a guy's trash.

Whose trash are they rooting through? A character named Werner Brandes.
Why are they rooting through his trash? They want to learn about Werner, his personality, his routine, his weaknesses. Because Robert Redford wants to find a way to get close to Werner and exploit him to break into his workplace.

Let's watch.

Toward a creepier definition of big data

Let's keep that clip in our back pocket for a few minutes, so that I can ask a question. 

What exactly is it about big data that makes it so creepy?

Here's the typical definition you'll see for "big data:" 

Big data is the collection of data sets too large and complex to be stored and processed using conventional approaches.

There's nothing creepy about large amounts of storage, though, is there? So then what is it? Well, let's take a look at some "big data" illustrations from vendors' literature to get a better definition.

We'll have to start with the most popular of big data images: the glowing binary tunnel of big data.

The blue binary data tunnel.
Obviously, there's a notion of data collection here (in binary of course...and blue, always blue), but the light around the corner seems to betray something more than collection. It's like the briefcase in Pulp FictionWe don't know what's in the briefcase, but we know it's important. It's glowing. It's special.

And below we have three images from big data technology vendor sites. Eyes, eyes, and more eyes. Once again, we've got binary (the last image has some 2s in it as well for some unknown reason). There's an element, then, of seeing something in this blue binary data. Discovery.
Next, we've got images of keys and locks. In one image a binary key unlocks a hex lock (?), and in another image, a key unlocks a lock in an eyeball made from both circuit boards and binary. Get an editor, people!
So what do these images tell us? That big data allows us to access something or do something that previously had been unavailable to us.

Here then is a new definition of big data based on this imagery:

Big data is the collection of data sets too large to be stored using conventional approaches for the purpose of enabling organizations to discover and act on insights in the data.

Is that creepy? I'm not so sure we're there yet.

After all, since the early nineties the airline industry has been collecting large data sets, discovering patterns in demand and elasticity in the data, and using those discoveries to lock and unlock fare classes programmatically in order to maximize revenue on flights. This practice is called revenue management or yield management, and it predates the creepiness of big data by decades.

And what about FICO's Falcon program for realtime fraud detection in streams of credit card data? That thing has been around since the mid nineties. But I think all of us appreciate that programs like Falcon lock our stolen cards down pretty quick.

So what is it then? Well, let's take a peek at some of the recent examples of big data creepiness. 

There's the case of the Android app "Brightest Flashlight Free," which tracked users' GPS locations for the purpose of passing that data on later at a profit.

We've all heard the Eric Schmidt quote, "We know where you are. We know where you've been. We can more or less know what you're thinking about.” 

Little did we know the "we" in his sentence was any company able to code up an app that turned on a cameraphone's flash.

And how about the case of Shutterfly predicting (albeit poorly) which users likely just had a kid, so that they could market them on preserving these precious moments? We don't really know what data Shutterfly used here. We don't even know if it was internal data or if they bought it from a broker. But it smelled funny to a lot of customers.
Nope, no predicted bundle of joy here.
For a little public sector flavor, we have the Chicago PD predicting and targeting possible future criminals for a bit of harassment, which they have called (in what is perhaps the best bit of pre-crime euphemistic language ever) a "custom notification."
A custom notification from the fuzz.
The Chicago PD had access to things like address/neighborhood, high school graduation, gender, and age, and all they did was bring this data together in an arguably discriminatory modeling soup and target the likely offenders without a whole lot of justification that harassment prior to a crime would reduce the incidence of criminal behavior.

So what's the difference between Delta performing revenue management or FICO running antifraud models and what we see today? 

In the present world of big data, there are more companies, combining more data sources, to produce more targeted actions.

More Companies

Let's recall the Sneakers clip we just watched. Robert Redford's team was no Fortune 500 company. They're just a small band of misfits, completely unknown as an organization to their target, to any regulating body, or to anyone, really. Yet they had the ability to pull data from the DMV, credit reports, organizational memberships, the household trash, etc. and use it to predict things about Werner Brandes (he's anal retentive, for one). That's the world we now live in, except teams of sneakers are everywhere, your household trash data can be pulled via API, and we're all poor Werner.

The Chicago PD and Brightest Flashlight Free are a far cry in size, budget, and sophistication from Delta or FICO. More companies can collect and use large amounts of data. And that's worrisome in the same way that giving everyone a gun is worrisome.

Sure, if everyone has a gun, then we're all armed, and that's supposedly safe in a mutually assured destruction kind of way. But on the flip side...now everyone has a gun including people who garner less public scrutiny and who don't know how to use their guns.

Does the Chicago PD understand that by delivering "custom notifications" to potential criminals that their potentially discriminatory actions are affecting, perhaps negatively, those same neighborhoods that will generate future streams of data for use in their models? Probably not.

So what we have then in the big data world, through more accessible software (open source data analysis and ML tools), accessible data (APIs, datasets for purchase, etc.), and cheaper compute power, is a situation in which we're arming organizations large and small, to use data to make potentially harmful decisions (e.g. precrime decisions, credit approval, job hiring decisions, admissions decisions, etc.).

And what creeps you out more? 10 elephants with rocket launchers on their backs or millions of mice with lasers?

More Data...Sources

When we look at airline revenue management or FICO's Falcon program, we see massive amounts of data lying under complex models, but these efforts used one source of data. In the airline industry they used demand data to predict demand. In the credit card industry, transaction data was used to predict transaction fraud.

But this isn't the case today. Just as we saw in the Sneakers clip, where the team combined public data with location data and transactional data (Werner's garbage), companies today combine data sources to give themselves more predictive power than any one source alone could offer. When it comes to combining data streams, 1+1=3. That's why Facebook recently acquired credit card purchase data from Axciom to combine with its ad impression data -- those two sources in isolation are somewhat interesting, but when combined, they create a far more powerful picture of what it takes to market successfully to someone.

In the case of the Chicago PD, we see the combination of multiple data sources (crime data, demographic data, education data, etc.) to support a modeling effort. This combination of data, including data that a company itself may not have even generated, supports the ability for just about any organization to get into the big data game. 

Obviously, the internet has created more and more streams of our data to be accessed by the originating companies as well as other companies to whom they might pass it. That's why the Brightest Flashlight Free was free in the first place. So that a stream of GPS data might be resold. And the proliferation of mobile devices, connected appliances (the Tesla I wish I owned, my Samsung washer machine, Nest), and wearables ensure that these data sources will only increase in number and in the scope of our lives that they can document.

In other words, it's not the number of records that adds a creep factor to big data. Lots of large data sets aren't creepy. The "big" in big data also has to do with the number of data sources a company can access, the variety in those streams' origins, and the scope of our lives that those streams cover.

More Targeted

In the case of revenue management in the mid-nineties, sure, closing a fare class and raising the price of an airline ticket was, personally, a pain in the ass. But that revenue management wasn't personal per se. The fare class that gets closed likely gets closed for everyone.

Contrast that with Shutterfly predicting who's pregnant and who's not. That's personal. The Chicago PD knocking on your door before you've done anything wrong is personal. The data that was tracked by your flashlight app wasn't sold in aggregate. No, individual location data was sold.

There's an upside to personal. Like when Disney talks about tracking everyone inside their parks to provide a "personalized experience." Creepy, yes. But when the park's attractions are customized to your interests, that's kinda nice.

But there's a risky side to personal, isn't there? Just as Robert Redford's team of sneakers use Werner's personal information to target him with the perfect computer date, so now can companies target us. What's a "personalized notification" from the cops? That might just be profiling or discrimination. And what's another word for highly personalized advertising that leverages data on your insecurities? We might just as well call that manipulation. And when organizations discriminate against us and manipulate us, that all sums up to a loss of control (agency, humanity).

It used to be that the ad industry had only advertising atom bombs at their disposal. Tactics like "sex sells." Now they've got data-powered advertising drone strikes. Things like, "Diet pills sell...not to everyone, but certainly to people with status posts and photos like you."

So let's take another swing at defining big data:

Big data is the collection and combination of diverse data sets too large to be stored using conventional approaches to enable organizations large and small to discover and act on insights in the data for the purpose of controlling individuals and targeted groups of people.

Ooooo, that gave me chills.

We Can't Keep Track of our Data

PictureAgatha Christie couldn't help but leak health data
Just like Werner Brandes couldn't keep track of what went into his household trash, we can't keep track of the data we generate anymore. Ever run Little Snitch on your computer to see where your data is going? The number of ad networks, dynamic content providers, and online analytics platforms that tag along on a single connection to cnn.com will blow your mind out your eyeballs.

And even if we could keep track of who's rooting through our trash, how can we possibly keep track of its potential uses? 

When Agatha Christie pushed data out into the world in the form of her novels, did she have any idea those sentences would be used to diagnose her with Alzheimer's?

Just like the "folded tube of Crest" in the Sneakers clip betrayed Werner Brandes' personality, we have no idea what the data-cum-bathroom-trash of our lives is betraying about us. Which then begs the question: while in the U.S. (and in most of the developed world) Protected Health Information is a category of data strictly guarded by law, do these protections matter anymore? 

If a model can look at my posts on social media and predict my health status with some level of certainty, does PHI cease to exist as a category? Is all information Protected Health Information?

People are not made up of silos. There is not a health category to John Foreman and a data scientist category to John Foreman and a good-times-on-a-saturday-night category to John Foreman. We are always our whole selves, and the data we generate will betray us holistically and in unexpected ways. Just because I'm at work doesn't mean that I haven't brought coffee breath from home in with me.

Maybe Companies Will Run Out of Things To Say

We've just about reached "peak creepiness" in this blog post. So let's try to find our way back down to a more level-headed view of all this.

Here's one counterpoint to this unnerving development I've raised in the past: sure these companies know everything about us, but do they really have all the content they need to target us?

I had a friend recently who said something like, "I dare all these websites to try to get me to engage with their ads. I don't buy anything, listen to anything, or wear anything, unless I'm getting the recommendation from a friend. No ad could possibly fathom my tastes." 

I don't doubt this friend. But perhaps companies don't need to find the perfect ad for him. Maybe they just need to change his personality temporarily until one of their current ads is a match.

As I pointed out in my previous post, Facebook's recent emotional manipulation experiment demonstrates that while companies can target us with content using big data, they can also target content with us. In other words, they can draw users emotionally nearer to their ads' targeted emotional states (insecure, feeling fat, whatever) by using tangential content (our friends posts on Facebook) to make us feel a certain way.

So if these companies aren't going to run out of data-guided ad strikes, what can we do?

Maybe We Can Own Our Own Data

Let's talk about being invited to a backyard party at Ryan Gosling's house.

If Ryan (we're on a first name basis) invites me over to his party, do I have to go?


But what if all my friends are going? I don't want to be the odd man out!

Do I have to go to Ryan Gosling's party?

The answer is still no.

So, then, I decide to go of my own accord. And I play croquet. 

A lovely time is had by all.

When the party is done, can I go to Ryan Gosling and say, "Hey Ryan. Great party. Thanks for having me over to your backyard. But those memories you have of us playing croquet together...I need you to give me those memories. I own them."

Ryan Gosling would probably look at you like you're crazy.

So how is using Facebook any different? Do you own the data you give to Facebook any more than you own Ryan Gosling's memories?

This is an unfair comparison in one way: when you use Facebook you actually have to contract with them through accepting their terms of use. The law, especially in the US, allows folks great freedom to contract. So you've explicitly, legally permitted Facebook to remember you playing croquet at Ryan Gosling's house when you posted that status about it.

The trade-off here is not between Facebook owning your Facebook data and you owning it. Either you agree to be on Facebook, in which case they have data on you, or you don't agree, in which case you do not own your Facebook data. Why don't you own it? Because it doesn't exist.

So the opposite of other companies owning and managing our data is oftentimes no data existing at all, so the "we own our data," "we are the data," "we the data," "we = data" argument falls apart a little bit.

And who's going to pay for all this infrastructure on which we will potentially own and manage our own data even if we did have a claim over it? I doubt most consumers would be willing to pay. We're not even willing to pay for Facebook.

All of that said, I believe that we can stanch the flow of data to companies. Sure, data must be passed to LinkedIn when I use the mobile app, but it often seems they're overasking. Systems can be developed that limit what apps ask for to what they might reasonably need.

Ryan Gosling can remember my croquet game, but he doesn't need to follow me into the shower.

At Least I Can Own My Identity, Right?

PictureWilliam Weld's Viagra prescription can't hide
This is the European approach. Sure, you can contract with Facebook and give them your data, but if it's got a personal identifier on it, like your name or email address, then you get some control over what's done with that identified data.

So companies are left with the option of anonymizing your data by stripping it of personal identifiers before passing it on.

That sounds good at first, and indeed, I think it's a regulatory step worth taking. But just like Agatha Christie Alzheimer's, our identity is embedded in everything we do. Our identity is not separable from the rest of us. 

In the case of Werner Brandes in the Sneakers clip, it took two data points of timestamped location data (when he left work on two different days) to personally identify him. No name or email address needed to single him out personally. And Werner's problem is our problem -- it turns out that anonymizing your personal data is very, very hard. Maybe even impossible. William Weld, the once governor of Massachusetts, learned this when his personal medical records were de-anonymized using basic public information (the zipcode where he lived, his age, gender, ethnicity).

So, sure, you can own your personal identifiers. And that'd work if your personal identity wasn't stamped over everything else you did. Brightest Flashlight Free could watch my GPS signal travel from my address in Avondale Georgia to my office next to Georgia Tech once or twice, cross-reference that with some property records, and they'd have my location data labeled.

Privacy Pragmatism: Regulate Sausage, not Cats

Let's say that I want to prevent cats from being put into sausages. 
Help me.
How do I go about that? Do I regulate cats at birth? If I know the cat isn't born in a sausage factory, then it's less likely to end up in sausage, maybe? Do I govern how cats can move from place to place or who can own a cat? That way I'll know if a cat moves near a sausage factory.

It seems easiest to just perform inspections of sausage factories themselves. It's not that cats exist, that cats roam, that cats can be herded or collected. If I care about how cats are used, then I should inspect possibly dangerous usage points.

This same argument, called "privacy pragmatism," can be made for data.

There's no way we can keep tabs on all data collection and retention. It's hard to know where all the data is going and indeed how it might be used to predict things we didn't intend for.

So what if we, instead, focus in on the things we don't want certain types of data to be used for and keep an eye on them?

For example, if loan approvals are a concern, then perhaps we inspect those institutions that make loans to make sure the inputs into their models are on the up-and-up. If one of the predictors is discriminatory, that's where it'll rear its head -- right before it enters the risk model (like a sausage casing). If an institution has collected some data, but they're not using it to make their decisions, then it's like a cat that hangs out in the employee break room.

Is There Another Way?

Let's take a look at some companies currently using big data and data science to do things that aren't terribly creepy.
MailChimp uses its massive Email Genome Project database to allow certain users it deems "not a bot" to bypass CAPTCHA without filling it out. eHarmony uses data on things as basic as food preference to create matches (couples where both partners enjoy Hardees are doomed to failure). Uber uses data to power things like congestion pricing in order to reduce wait times. And Spotify powers its music discovery service with data.

Do these uses feel creepy? Certainly not Facebook levels of creepiness.

What's going on?

When we look at the creepiest of big data companies, what we see is that their customers (those who give them money) are not their data-generators. In the case of Facebook, its customers are advertisers and its users (us) give them data. In the case of Brightest Flashlight Free, the app developer acted as a data broker, selling location data on, so the users of its free app provided the data, while other companies who purchased that data were the actual customers.

Contrast that with MailChimp, Spotify, and eHarmony. All of these companies use subscriptions to receive money directly from their users. And Uber receives money directly from its riders. 

The incentives are completely different, and those incentives are directly tied to how creepy a company will act. If my goal is to make money off of customer subscriptions then I am disincentivized from creeping those same customers out.

These not-so-creepy companies are using data to improve their user experience. In my opinion, this is the future for analytics; a future that's no longer at odds with those providing the data.

But until we get to that future, if you cut into your sausage and you hear a meow, you might want to talk to a health inspector.
<![CDATA[Facebook's solution to big data's "content problem:" dumber users]]>Mon, 30 Jun 2014 02:06:08 GMThttp://www.john-foreman.com/blog/facebooks-solution-to-big-datas-content-problem-dumber-usersIn the early days of cinema, Soviet filmmakers were fascinated with film editing, i.e. placing shots in an arranged order. One of these filmmakers (and possibly the first film theorist), Lev Kuleshov, proposed that the emotional power of cinema lay not in the images themselves but in the way they were edited together.

For Kuleshov, the sequential juxtaposition of content lends meaning to images that may have nothing to do with each other. 

And he conducted an experiment, the so-called Kuleshov effect, to highlight this principle. Kuleshov took a clip of Ivan Ilyich Mozzhukhin (the Ryan Gosling of Tsarist Russia) staring at the camera and intercut it with some other images: a bowl of soup, a girl in a coffin, an attractive woman. When he showed this sequence to an audience, the viewers noted how the emotional state of the actor changed from cut to cut. He's hungry. He's sad. He's got the hots for that lady.

The audience praised Mozzhukhin's emotive performance. But the actor's stare that Kuleshov used was the same in each cut. Here's an example of the effect:

The Kuleshov effect tells us that a viewer's understanding of what they're seeing can be affected not just by the content on the screen but how it's edited together.

Editing manipulates and creates meaning by connecting (potentially unrelated) content together. It provides emphasis and perspective.

EdgeRank is a frenemy

Think of the content going through your Facebook stream as shots in a film. These shots are just pure life, right? Just an unedited stream of your friends' cats, kids, and poorly lit photos of quinoa entrees.

We know this isn't true. After all, Facebook uses EdgeRank to select what is shown to each user out of all the content available. The algorithm's whole purpose is to maximize an outcome by editing content together. And until recently, we've assumed that the objective of EdgeRank was more or less to maximize the engagement and relevance of posts in each user's stream. Ain't nothing wrong with increasing relevance! Sure, EdgeRank is an editor, but it's on your side. 

You're Scorsese, and EdgeRank is a Thelma Schoonmaker. Or maybe not.

Facebook's PNAS disfunction

In the current issue of the Proceedings of the National Academy of Sciences (PNAS), data scientists from Facebook show how they can use editing within feeds to affect the emotions of Facebook's users. Facebook data scientists demonstrated that when they prioritized happy content on a user's feed, that user was more likely to post happy content back to the network. The same went for disgruntled content.

These posts aren't necessarily related to each other. They're just content generated by a user's friends, liked pages, etc. But just like Kuleshov did for film, Facebook can do for their network -- they can stitch images and text together into a stream that meets their needs, one that conveys a concept perhaps not present in the sum total of all the social content lying on the cutting room floor.

Now, people have a problem with this. The whole experiment feels like a violation. Facebook emotionally manipulated people. And to add insult to injury, Facebook used user-generated, supposedly perspective-free content (at least free of Facebook's perspective).

The counterarguments I keep hearing are these:
  1. Facebook's TOS covers such experiments
  2. Facebook was already using and continues to use algorithms to edit your stream. This is nothing new
  3. Facebook didn't create content to manipulate people. They used existing content
Points (1) and (2) demonstrate extreme naiveté on the part of Facebook. 

What is allowable in data science is only partially governed by TOSs and precedence. There's an inherent creepiness to data science, so it's important that a company always ask itself, "What do our users expect from us?" Not "what is legal?" or "can we point to what we've already been doing that's similar?"

Actually, Facebook may not be naive. They may just not care. After all, their customers whom they're trying to impress and engage with using data science are their advertisers, not their users.

Counterargument (3) is where the Kuleshov Effect comes into play. Editing is powerful. If you're stewing up a pot of social slop, then you have power over the final product. A stream is nothing more than a montage of social content that constitute its ingredients. And in the creation of that stream, Facebook wields immense power even though they create none of the stream's content.

Regardless of where you fall in the debate over whether this was an appropriate experiment, its results lead to a more haunting realization. Before we get to it, let's talk about the "content problem" present in data-driven targeted marketing.

Big data has a content problem

A lot of digital marketing tools are coming out these days that promise hyper-specific tracking and data collection of leads, customers, users, etc. for the purposes of surgically targeted marketing.

There's only one problem: the only reason to target someone at a personal level is if you've got personalized marketing content to show that person. 

Understanding a person intimately and being able to target them is nothing without something to say.

And most companies don't have anything to say. Getting a marketer to finish one monolithic piece of creative is hard enough. Imagine needing personalized content for everybody! 

So shortcuts are taken ("just write 'customers like you also bought this' and then use data science to pull some product suggestions) to produce "relevantized" generic content. 

No matter how sophisticated data driven targeting products get, there will always be a content gap.

But Facebook may have found a shortcut. And this is where things get depressing.

Data science: a sheep dog, corralling people toward content

If I have a bunch of unique people, and I need to target them, I need a bunch of unique content to make that effective. Is there another way?

Rather than tailor marketing content to a user's unique emotional make-up, Facebook has shown that they can use tangentially related (and free!) user-generated content to push a user toward marketing content generated for a more general emotional state: insecure, hungry, lonely, etc. They can edit together photos and posts in a stream to skew a user's view of reality and shift them into one of these compromised emotional states.

In other words, if they can't use data to generate enough personalized content to target people, maybe they can use data to generate vanilla people within a smaller set of emotional states. Once you have a set of vanilla people, then your American Apparel ads will work on them without customization.

As Greg McNeal put it:

"What harm might flow from manipulating user timelines to create emotions?  Well, consider the controversial study published last year (not by Facebook researchers) that said companies should tailor their marketing to women based on how they felt about their appearance.  That marketing study began by examining the days and times when women felt the worst about themselves, finding that women felt most vulnerable on Mondays and felt the best about themselves on Thursdays.

The marketing study suggested companies should “[c]oncentrate  media during prime vulnerability moments, aligning with content involving tips and tricks, instant beauty rescues, dressing for the success, getting organized for the week and empowering stories… Concentrate media during her most beautiful moments, aligning with content involving weekend guides, weekend style, beauty tips for social activities and positive stories.”  The Facebook study, combined with last year’s marketing study suggests that marketers may not need to wait until Mondays or Thursdays to have an emotional impact, instead  social media companies may be able to manipulate timelines and news feeds to create emotionally fueled marketing opportunities."

This is part of the dehumanizing effect of AI and big data I wrote about a while ago.  Rather than data being used to make computers more intelligent, data is being used to make humans more predictable (read: more stupid, unhappy, and willing to buy something to palliate their discontent).

Yann LeCun, who runs Facebook's AI lab, said I'm utterly wrong on this point. In his response to my last post, he contends:

"The promise of ML (and AI) is, on the contrary, to let humans be more human, to free their minds from having to reason like machines, and to let them concentrate on things that are uniquely human, like communicating with other people."

In this particular study in PNAS, we can see that the promise of data modeling at Facebook is not to "let humans be more human." It's not to "free their minds."

All of that machine reasoning isn't trying to make us more human so much as it is trying to make us more sad and predictable. And just wait until deep learning applied to image recognition can recognize and stream my selfie at Krispie Kreme next to a tagged photo of me and my love handles at the beach. Data-driven inferiority complexes for all!

The promise of data modeling at Facebook is to place us in chains made from the juxtaposition of our own content. We'll be driven into pens made of a few profitable emotional states where marketing content waits for us like a cattle gun to the skull.

That said, where else am I going to share photos of my kids with old friends? Can't do that on Twitter...I only use Twitter to express faux indignation and convenient morality concerning trending causes. Looks like I'm stuck with Zuck.
<![CDATA[Data Science Hearts User Experience]]>Fri, 23 May 2014 15:00:27 GMThttp://www.john-foreman.com/blog/data-science-hearts-user-experienceI feel bad for data scientists that have gotten stuck on ad targeting. 

When your job is to increase engagement on something you've shoved in someone's stream or in someone's search results, then you're effectively pissing against the UX wind. That can't feel good. Your job is antagonistic to your users' goals (because your users aren't your customers).

Furthermore, what happens when a graphic designer comes in, removes the beige background from an ad, and produces more ad revenue than your deep learning models ever could? 
What's the end game for better ad targeting? Is it the deepest of all deep learning? Hmmm. Maybe it's just making ads blend in visually with real content and duping your users. That's why I feel bad. Who wants to learn a bunch of math only to realize your job is done better by simply removing a background color?

And this is why I love my current job. My team's goals are lined up directly with the user's. The company I'm at makes its money off of monthly subscriptions, and my job is to make the product better using data so that folks want to join and stay. That's much more rewarding.

Now, am I building high end AI models? Occasionally I get to build an AI model I'm proud of. But predictive modeling for its own sake is not the goal. 

No, the goal is a better UX. And that means that I can use data in small ways too. I'll give an example. 

Users entering customer support would constantly complain to us about reCAPTCHA. Just look at this screencap. Yikes.
I fail these humanity tests every other time. And we tried subbing out reCAPTCHA for other more game-ified humanity tests. But those didn't work out.

We realized that our own internal anti-abuse models could be turned on this problem. Now we're able to validate the humanity of a vast number of customers who come to us -- and we're able to just hide reCAPTCHA altogether. Better living through data science!

This project didn't take long, and it's certainly not going to grab headlines. But we didn't do it to increase a KPI, impress the tech blogs, or justify our graduate degrees. The team did it for our users, and that feels good.

<![CDATA[The Forgotten Job of a Data Scientist: Editing]]>Thu, 08 May 2014 17:46:08 GMThttp://www.john-foreman.com/blog/the-forgotten-job-of-a-data-scientist-editingI have made this [letter] longer than usual, because I have not had time to make it shorter. –Blaise Pascal

Within the arts, there has always been a tension between ornamentation and simplicity. The good artist is one who can execute a technique exceptionally, but the great artist is one who knows when to hold back.

I myself love a good grilled London Broil. But whenever I make it myself, it tastes like pencil erasers. I learned that I was adding way too much oregano to the marinade, and it was overpowering everything. If a little is good, how can more not be good-er?

It’s like Reinhardt said, “The more stuff in it, the busier the work of art, the worse it is. More is less. Less is more.”

Data science is a young occupation that could stand to take from these older pursuits. Whether it’s writing, cooking, or painting, editing is a core component of becoming a master of the discipline. Knowing when to hold back.

The same is true in analytics. Oftentimes, a data scientist can build a better model, a more complex model, a more accurate model. But that doesn’t mean they should.
Reinhardt's crazy minimalist painting consisting of black squares.
An article this week proclaimed, much to the data science community’s chagrin, that “most of a data scientist’s time is spent creating predictive models.” Forget about cleaning data, doing historical analyses that go into basic reports, etc. Apparently, the core job is predictive modeling. I fear for the company who hires any data scientist that believes that. Not only because they’re not going to get anything of practical worth done, but also because with that type of mindset would a data scientist ask one of the most important predictive modeling questions of all:

Do I really need to build this model? Can I do something simpler?

If your job is building models, all you do is try to build models. A data scientist’s job should be to assist the business using data regardless of whether that’s through predictive modeling, simulation, optimization modeling, data mining, visualization, or summary reporting.

In a business, predictive modeling and accuracy is a means, not an end. What’s better: A simple model that’s used, updated, and kept running? Or a complex model that works when you babysit it but the moment you move on to another problem no one knows what the hell it’s doing? 

Robert Holte argued simplicity versus accuracy in his rather interesting and thoroughly-titled paper “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets.” In that paper, the *air quotes* AI model he used was a single decision stump – one rule. In code, it’d basically be an IF statement. He simply looked through a training dataset and found the single best feature for splitting the data in reference to the response variable and used that to form his rule. The scandalous part of Holte’s approach: his results were pretty good!

And so, Holte raises this point, “simple-rule learning systems are often a viable alternative to systems that learn more complex rules. If a complex rule is induced, its additional complexity must be justified…”

Complexity must be justified. Stop and think about that a moment.

If I’m working on a high frequency trading model, additional complexity in an AI model might mean millions of dollars in revenue. And I’ve likely been hired to gold-plate an AI model where there is no other end user beyond myself and a few other PhDs.

But I don’t work in high frequency trading. I work at a much more fun company. That means that while some of my models are mission-critical, i.e. revenue is on the line and every bit of accuracy helps, other models are less important. 

Recently I needed to build a model that classified users into three groups: paid, likely-to-pay-in-the-future, and will-never-pay-us-a-dime. I could have built a highly complex model. There were many weak predictors at my disposal (“have you generated an API key in the first 30 days?”) that would have worked great in an ensemble model. But I stopped myself.

This was a model that no one would have the time to babysit. And our business wasn’t on the line. It just needed to be something that worked better than checking “free/paid” within an account.

Holte argues in his paper that often datasets used in predictive modeling have relatively few features that really stand out as predictors. Investigating my data, this was the case. There was one feature that was highly predictive of likely-to-pay-in-the-future. And there was a second feature that was almost as powerful. Other than that, the other features boosted accuracy marginally.

So rather than build a full-fledged AI model, I handed an engineer an IF statement: “Check this and this in a user’s account. If those two things are true, they’re likely to pay us in the future.”

Now, I still let the business know the FPR and TPR of this simple model. You’ve gotta let people know what they’re sacrificing by going with a simple approach. Because whether or not you increase complexity for additional accuracy is not a data science decision. It’s a business decision.

So keep in mind that as a data scientist you’re a technical person, but there’s also a bit of artistry in your job. You are an editor subject to the needs of the business.
<![CDATA[A Live Introduction to Data Science: Naive Bayes and Holt-Winters Forecasting]]>Wed, 12 Mar 2014 19:52:35 GMThttp://www.john-foreman.com/blog/a-live-introduction-to-data-science-naive-bayes-and-holt-winters-forecastingA month ago I spoke at Strata Conf in Santa Clara. Unlike my usual talks, this one was a 3 hour tutorial taken from my book, Data Smart. Specifically, I spent 3 hours doing chapter 3 and an abbreviated chapter 8. On the whole, I think it was a useful session, so I've posted it online so anybody can follow along.

Chapter 3 is an introduction to supervised machine learning via naive Bayes. And Chapter 8 is demand forecasting using triple exponential smoothing (Holt-Winters).

The first 5 minutes of the first video have garbled audio, but the tech fixes it before too long. My apologies.

To follow along, download the spreadsheets for chapters 3 and 8 from the downloads section of the book's website:

Direct links:

You'll need to unzip these spreadsheets and clear some data out of them per my instructions in the talk. Also, you'll need access to spreadsheet software.

You're doing great, hang in there! Part 2:
Can you feel the knowledge washing over you? Part 3:
One lap to go!!! Part 4:
<![CDATA[The $30/hr Data Scientist]]>Thu, 06 Mar 2014 21:45:47 GMThttp://www.john-foreman.com/blog/the-30hr-data-scientistYesterday a journalist asked me to comment on Vincent Granville's post about the $30/hr data scientist for hire on Elance. What started as a quick reply in an email, spiraled a bit, so I figured I'd post the entire reply here to get your thoughts in the comments.

When we ask the question, "Can someone do what a data scientist does for $30/hr?" we first need to answer the question, "What does a data scientist do?" And there are a multitude of answers to that question. 

If by data scientist, we mean " a person who can perform a data summary, aggregation or modeling task that has been well-defined for them in advance" then it is by no means a surprise that there are folks who can do this at a $30/hr price point. Indeed, there'll probably come a day where that task can be completed for free by software without the freelancer. This is similar to the evolution of web development freelancing.

The key phrase though is "task that has been well-defined."

The types of data scientists who command large salaries seem to meet two very different definitions than what a freelancer at $30/hr can meet:

1) There's the highly-technical engineer. Someone who is knowledgeable and skilled enough to select the correct tools and infrastructure in the polluted big-data landscape to solve a specific, highly-technical data problem. Often these folks are working on problems that haven't been solved before or if they have there are only a few poorly documented examples. Because these tasks might not even be solvable, they're certainly not "well-defined." A business wouldn't trust important bits of infrastructure to $30/hr.

2) There's the data scientist as communicator/translator. This person is someone who knows data science techniques intimately but whose strength is actually in the nontechnical -- this person thrives on taking an ambiguous business situation and distilling it into a data science solution. Often managers and executives don't know what's possible. They know what problems they have, but they don't know how or even if data science can solve those problems. These folks can't hire someone halfway across the globe at $30/hr to figure that out for them. No, they need someone who's deeply technical but also deeply personable in the office to talk things through with them and guide them.

All of the hype around data science is generating a lot of these articles about automating or replacing the role. But 
I think it's important to realize that just like "doctor," "lawyer," "consultant," "developer," etc. the "data scientist" is more of a spectrum or category than a single role.

A data scientist is not someone putting doors on an automobile in a factory. Some of them might be doing just that, i.e. rote modeling tasks. But not all of them. I believe that MOOCs will excel at teaching up an army of these lower-paid data scientists. And that's great. They'll fill a need. Kinda like the need in the 90s for people with basic COMPTIA certifications and the most basic of Cisco certs.

However, there will always be a place for those who excel at solving ambiguous technological & business problems. And they'll cost more than $30/hr.