While having enough data is always good, reaching for more and more data might not be better for your company. A couple of Andreesen Horowitz researchers question the benefits of endlessly accumulating data.
Here is an excerpt from an interesting blog post on boingboing.net:
In a smart, eye-opening essay, Martin Casado and Peter Lauten from the VC firm Andreesen Horowitz dismantle the idea that data benefits from “network effects” and that it presents any kind of “moat” to protect businesses: instead, the VCs demonstrate how collecting data gets more expensive, and less useful, over time.
To understand why, think of Netflix’s data-collection, performed in service to its famous recommendation engine, which suggests programs you might enjoy based on the preferences of people who are similar to you. When Netflix is starting out, it needs to develop a “minimum viable corpus” in order to produce recommendations, but once that data is in place, new data produces diminishing returns in recommendations. Going from 100 to 1,000,000 users allows Netflix to dramatically improve its recommendations, but going from 1,000,000 to 1,000,100 (or even 2,000,000) produces very little new benefit.
Meanwhile, adding in all that data is expensive: first, because once everyone who already understands why they might subscribe to Netflix is a customer, Netflix has a much harder job of convincing the remaining population that it’s worth their while to join up (their “cost of user-acquisition” goes up). Second, the computational costs of incorporating new data into a prediction model don’t necessarily go down with volume, so the cost of recomputing the model when you add your 1,000,001st user isn’t necessarily cheaper than when you add your 101st user (it might even be more expensive), and since the new user adds less value to the model than previous users did, the real costs of new users’ data (relative to the benefits) are constantly going up. Add to that the other costs associated with new data: accurate labeling, more noise to lose the signal in, higher security costs…
And if that wasn’t bad enough, the advantage your data brings you declines over time as the data gets stale (video tastes, traffic patterns, and other sources of business advantage change with time), and as time goes by, your competitors are able to assemble their own “minimum viable corpuses,” further eroding that advantage.
You can read the full Casado – Lauten article here.