Naturally, there are things in the post I don't agree with. Data writes my paychecks. And then I cash those checks. And spend that cash. My expenditures are paired with ad impression data to create a formula that feeds little baby data scientists. It's a circle of data. And it moves us all.
I'm going to jump around in Karpinnen's post, responding to the points I care about.
In the blog post, the author seems to qualify "old data" as something that's not happening right now; for example, data that's a year old.
This doesn't match my experience. Many of the products I've created at MailChimp are powered on ostensibly "old" data. For example, our antiabuse model, Omnivore, runs on years' worth of data to evaluate list uploads from users prior to content creation or sending. And while current data is quite valuable for these models, old data is equally valuable. If a user uploads a list of email addresses that we haven't sent to in a year (and we send over 15 billion emails each month), there's a staleness there that's alarming.
Let me give a more recent example than Omnivore: MailChimp Pro. We're releasing it this week. Pro is a collection of data-centric features; things like multivariate testing, user credit scoring, and post-send data mining.
MailChimp Pro was conceived of and designed by analyzing years of account data and survey data. Hell, we even used this historical data (things like exit surveys and on-boarding data) to understand price sensitivity in order to price MailChimp Pro correctly (friends don't let friends cost-based-price).
The default timing and sample size values placed in MailChimp Pro's Multivariate Testing feature came from studying years of A/B tests sent through our system. It's a good thing we didn't throw that data away or start collecting it only when we decided to build multivariate. Similarly, our Pro Compliance Insight tool, which functions similarly to a credit score, is powered off of a huge amount of historical marketing performance data.
"Sure, spotting trends in historical data might be cool, but in all likelihood it isn’t actionable"
I would contend the opposite is true. Spotting trends in historical data is often actionable even though it is not cool. What am I thinking about?
At MailChimp, my team does a lot of forecasting using historical data. And a good forecast that's going to separate out things like seasonality and trend often demands years of historical data. This process of producing demand forecasts rarely feels cool, but the business certainly takes action on it. Investments in people and infrastructure are often driven by forecasts.
And if you get these decisions wrong because you don't collect data and instead operate on your gut, you might end up hiring a bunch of people you don't need.
"You start with the questions you want answered. Then you collect the data you need (and just the data you need) to answer those questions."
There's some truth here. It is always important to lead data science and data product efforts with specific questions and goals. Too many data projects have failed because the only goal was "if you get it in hadoop, they will come."
At the same time though, you can't wait until you know what you're going to do when it comes to data collection. What if the problem you want to solve requires lots of historical data (many ML models do, including the examples I gave above)?
My team has a data product in alpha that's powered by years and years of list subscription data. And all of it is important for our ML models. We started building this particular model around the new year. What if we'd waited until then to collect list subscription data? We'd have had to wait for years to produce the accuracy we got immediately, because we'd already been collecting data.
"When the amount of data gets truly big, so do the problems in managing it all."
Sure, if by "managing," the author means "using in production" where things like speed and availability matter. But collection isn't terribly hard (yay log files!). So go ahead and start collecting those large datasets in all of their terribly-formatted glory, but don't start using that data until you know what you want to do. That way you can be choosy and pull a smaller, easier to use set if you need to.
"You can’t expect the value of data to just appear out of thin air. Data isn’t fissile material. It doesn’t spontaneously reach critical mass and start producing insights."
The first part of this point is right. The value of data doesn't just appear out of thin air. It's given value via its usage. And that requires intentionality and planning and good ideas (shouldn't stop you from collection before that point!).
That said, in my experience, there does exist a point in certain scenarios where data reaches a "critical mass." And that critical mass often has to do with coverage. A gas station may be hard-pressed to predict a whole lot about you based solely on the snacks you buy. They only see a small fraction of your purchases. Especially compared to something like Facebook + Acxiom. Their data set covers so many purchases and ad impressions that they can engage in modeling efforts your local gas station can't touch.
I've seen this at MailChimp. Three years ago, we didn't have enough eCommerce360 data to conduct a whole lot of ecomm modeling. But with the success of our many shopping cart integrations and the growth of our company in general, our accumulated ecommerce data set has reached a critical mass. All of a sudden, value has appeared out of thin air; in other words, I can entertain modeling and product ideas that use the data that I never would have entertained three years ago when user coverage and historical depth hadn't built up yet.
Data is an asset
Like I said, I'm biased, because data is my life and livelihood. But I've seen company after company, my own, other tech companies, Fortune 500s, hell even the government, benefit from collecting and using large, oldish datasets.
If you're on the fence, I'd recommend finding a cheap, lazy way to store data that you suspect might be interesting to your company later. It doesn't have to be stored all structured and beautiful. It doesn't need to be accessed quickly. Hell, pull out that old Jaz drive and shove text files onto it.
But don't let down future-you once your business has changed or grown when, all of a sudden, that dusty data serves a purpose. If you don't start collecting until you know precisely what you want, I hope you're in the time machine business.