Small Data: or, Decisions Under Massive Uncertainty
"Whenever I see a forecast written out to two decimal places, I cannot help but wonder if there is a misunderstanding of the limitations of the data, and an illusion of precision." - Barry Ritholtz
Big data is easy.
Now, that’s not to say that there aren’t significant technical challenges to leveraging big data. Processing files too big for one computer’s memory and storing data in fault-tolerant yet efficient ways are significant technical challenges that engineers are paid large amounts of money to solve. However, large-scale enterprise solutions have been available for each of them for the better part of ten years. Applying these solutions is as easy as paying a competent coder to clean the data and try a series of models until one appears to work.
On the other hand, decision-making when faced with almost no data is still fundamentally a qualitative task. When you are considering whether to invest in a small company a few days out from its IPO, putting money on the result of the U. S. presidential election, or pricing a small company’s new product, large quantities of data are not available to learn from. In these situations, there is still no substitute for expertise and human intuition.
At Invrea, we refer to this second group of tasks as small data tasks. In this post, I aim to illuminate the fundamental differences between small data and big data problems, and give some insight on why the solutions to each problem must look and feel completely different.
On Certainty and the McFlurry
Ask yourself this fundamental question: how much data, and how much accuracy, do you need to make a decision?
If a complex model tells you that there is a 99% probability of profit following your plan, do you recommend that your client follow your plan? If a weather forecaster tells you there is a 10% chance of precipitation, do you reconsider going to the beach? If an advisor tells you that there is a 75% chance an individual will pay back their loan in full, do you allow them to take out a loan?
This is obviously not a question that has a straightforward, general-purpose answer. Your decision on whether to proceed in any given case will obviously depend on the probability of success as well as the benefits of success, the consequences of failure, reputation risk, and dozens of other qualitative factors that cannot in general be enumerated. Nevertheless, working with uncertainty means confronting this question head-on every time a major decision is made.
This issue of certainty is the primary axis that separates big data problems from small data problems. To generalize, solutions to big data problems are so accurate that they are considered to be certain. When Wal-Mart assesses the average age of its customers, it doesn’t have to worry about sample sizes or biases; it can simply do the math. When McDonald’s wants to understand the demographic that buys a McFlurry, it can run algorithms that segment its customer base into distinct groups, and report the results.
To ask for answers without big data at your disposal, you must accept that the answers you receive will be uncertain. Ask yourself, how do you think McDonald’s planned their budgets in advance of releasing the McFlurry? To plan a budget requires projecting sales, yet for a product that has not been released, no sales data is available. Yet again, a combination of qualitative and quantitative reasoning is required; perhaps similar products have been released before, perhaps a trial run was performed in a small city somewhere in Ohio, perhaps running slightly under budget for a few months is allowed, perhaps it would be a disaster.
The point is, big data problems can be straightforwardly solved with a computer. Certain answers can be produced and used in decisions. Small data problems inherently generate uncertain answers; using these to make decisions requires a rare combination of intuition, statistics, and business experience.
On Data, Algorithms, and Factories
Imagine big data algorithms as a factory. This factory takes in a gigantic volume of disparate kinds of materials, and uses these to produce one very specific kind of output. This factory’s floor is completely computerized, but its strategic decisions are made by a committee of humans sitting on the top floor. This factory’s quality is virtually guaranteed – regular examinations are conducted to ensure that manufacturing defects are reduced to an absolute minimum.
Big data algorithms are factories that take in a huge volume of different types of data, output simple human-readable insights about this data whose quality is assured, and that are planned and managed by a committee of data scientists. Statistical models are the blueprints used to build these factories, Hadoop and Apache Spark the construction companies and builders. To be a data scientist is to oversee the building of these factories, and to tweak them continually until quality can be assured.
Like these factories, big data algorithms should feel vast, impersonal, and complex to build and maintain. They consist of floors and floors of interlocking parts, machines operating independently yet controlled by a central intelligence, but through it all somehow guaranteeing near-perfect results. However, the algorithms are fundamentally simple: anyone can look at the data coming in, and see the conclusions rolling out. Their designs are complex, but their purposes are straightforward: manufacture a very specific type of conclusion from data.
On the other hand, small data algorithms are more bespoke. View them as one-floor manufacturing plants manned entirely by humans. These humans can process less raw materials, but the kinds of output that they construct can be far more complex and much more tailored to the type of input received. And to make decisions under uncertainty, these complex products are exactly what you need.
A group of twenty skilled humans can’t build a million Volkswagens in a year – leave that to the robots. However, they can spend a year building the fastest Formula One car in the world. Similarly, a big data algorithm is best for making a straightforward decision using twenty terabytes of data, while a small data algorithm is best for making a complex decision using twenty kilobytes of data.
Where does Scenarios fit in?
Invrea Scenarios is the world's only Excel plugin that can solve small-data problems. In terms of the above analogy, Scenarios is the small factory of skilled humans, and it puts the user in charge of the plant. The user produces designs - statistical models - in the form of Excel spreadsheets, which are implemented by the factory. Unlike any other product, Scenarios does not just use those designs to generate results - it learns better designs from the data itself, and uses those 'bespoke' designs to generate results that will help the user make better decisions. For a brief explanation of the methods used to achieve this, see our earlier blog post on Bayes' theorem here.
Scenarios does not tell the user which decision to take (although it can be made to). At Invrea, our business is not replacing or invalidating intuition and experience; our business is augmenting domain expertise with the tools of probabilistic programming, machine learning, and artificial intelligence. If you think our description of small data problems matches a problem you need solved, request to download the new release of Scenarios here.
Invrea Scenarios helps a lot with making these kind of predictions but it doesn't stop there. The plugin can model uncertainty and make predictions given our assumptions and new data for business decisions, insurance claims, consulting cases, etc. If you can model your decision as relationships between cells in an Excel spreadsheet, then it's quite likely that Scenarios can help. The team at Invrea is dedicated to opening this kind of machine learning to every industry possible. If you would like more information, a more detailed demo, or some help setting up a worksheet of your own, we'd love to lend a hand. You can find us at this email.
The alpha version of Invrea Scenarios is free. You can request a download link here.