Disclaimer: this post is a follow-up to our previous blog post, which discussed the distinction between 'small data' and 'big data' problems, and the importance of that distinction. You can read that post here; I recommend doing so before continuing further.

So as implied by our inclusion of the quotes above, it's clear that the team at Invrea believes that in all the hype about 'big data' tasks and artificial intelligence, there is a crucial piece of the discussion that is often missing. This post is an attempt to augment the existing popular perception of statistics and machine learning with one crucial addition. This missing piece is discussing models themselves, as objects, in addition to the data that they are given. Bringing the nature of models into the discussion will make it clear how Invrea can discuss Excel plugins and artificial intelligence in the same sentence - without changing the subject in between, as unlikely as that may sound.

What is a model?

A model is simply a procedure that takes in some data, and outputs predictions. Notice that this is an extremely broad definition. When an expert looks at data and makes conclusions, such as when a physician studies a brain scan and decides upon a diagnosis, they are running a model. When a credit card company runs a massive computer program to analyze all transactions and flag a small selection as suspicious for later human analysis, that program makes its decisions according to a model. A model can take the form of a program, an intuition, a guess, a procedure, an algorithm, a formula, a hypothesis, a theory, or a law.

In the previous blog post, a division was drawn between models that make predictions based on big data, and models that make predictions based on small data. Data was compared to the raw materials sent into a factory, and predictions as the output of that factory. Models, meanwhile, were visualized as the blueprints used to draw up these factories - the rules that make the raw data into the finished predictions. All models that are applied exist somewhere on the spectrum between small data and big data, but this is not the full story.

In order to solve real-world problems, it is also necessary to classify models according to their complexity. Complex models require more mathematical skill to dream up, and more programming skill to apply in practice. Obviously, because of this, we prefer to use in each situation the simplest model that will suffice – that will give us the accuracy we need. However, real-world problems often nevertheless require extremely complex models.

In the factory analogy above, a simple model would correspond to a factory whose inputs look a lot like its outputs, and where the internal processing within the factory is kept to a minimum. For example, sawmills are relatively simple. In brief, logs are debarked, sorted, sawed, trimmed, dried, and then smoothed. While technology has allowed production to ramp up massively in terms of scale and speed, the basic outline of the sawmill hasn’t changed since the invention of the circular saw in the late 1700s.

Now compare the humble sawmill to a plant that fabricates semiconductor devices for use in computer chips. This plant processes as raw materials disparate chemicals that number in the thousands, goes through a number of deposition, purification, and extrusion steps, and in the end creates a product that looks nothing like any of its raw materials. Precision down to the molecular level is required, errant vibrations or flecks of dust can ruin a circuit beyond repair, and billions of dollars are invested in each plant. The semiconductor fabrication plant is much more analogous to a complex model.

How are data and model complexity related?

We had previously categorized models into those that make use of big data and small data; now we categorize models into simple and complex as well. All models exist somewhere within this two-dimensional spectrum, giving us the following rough map:

As you can see, we’ve now classified models into four distinct groups. As the purpose of a model is to make predictions and to solve a problem, we can discuss these groups of models in terms of the problems that they are each intended to solve.

The bottom left group - refer to it as Group One - includes problems that require very little data and a very simple model. These problems are simple to describe and understand, require very little mathematical background, and can often even be solved by hand, without using a computer. An archetypal example might be as follows: you want to determine the average height of men in America, and are given thirty samples. If these samples are independent and precision isn't critical, you can just compute the average of these thirty samples and stop there. For a more formal label, we can refer to these problems as classical statistics.

Next we move to Group Two, the bottom right group. These problems make use of gigantic amounts of data, but perform relatively simple tasks with them. They are what is traditionally referred to as 'big data' tasks. These problems can be understood with very little mathematical background, but solving them probably requires the use of large networks of computers. An example might be computing the average of trillions of numbers, or doing linear regression from one trillion input data points to one billion output data points.

The penultimate group is Group Three, on the top left. These are tasks that require heavy mathematics expertise, but for which not much data is available. At Invrea, we call these 'small data' tasks; while the classical statistics problems discussed above also use little data, we focus on Group Three because Group One problems are largely already solved. These problems were discussed in more detail in the previous blog post, which you can find here. For an archetypal Group Three problem, imagine commissioning a study to determine the relationship between dosage of a drug and the likelihoods of various side effects. The nature and degree of these relationships is unknown to you, and the number of individuals is small. At Invrea, currently, solving these types of problems is our primary interest.

What does this have to do with Artificial Intelligence?

The human brain receives - according to some estimates - tens or hundreds of megabytes of data per second, and probably stores at least dozens of petabytes of information (a petabyte is one million gigabytes). The brain is constantly analyzing this data and using it to make complex decisions and predictions, from determining the geometry of objects surrounding it, to playing chess, to composing songs. The human brain copes with a veritable firehose of data and runs models more complex than any others we have seen in the Universe - this places it firmly in Group Four. If a model could be built and implemented that replicated these capabilities in hardware or software, it would be reasonable to refer to it as artificial intelligence.

Invrea Scenarios cannot give you that; no one can. Roughly, Scenarios is a tool designed to solve problems situated within the following bounds:

Which problems this includes exactly will have to wait until future blog posts. In the meantime, you can read about the fundamental connection between model complexity, data size, AI, and Bayesian statistics here; however, it would be an excellent idea to first review our earlier blog post introducing Bayes' theorem, which you can find here.

The alpha version of Invrea Scenarios is free. You can request a download link here.