Why Small Data Is the New Big Data

By GAM Investments | More Articles by GAM Investments

by Dr Chris Longworth and Silvia Stanescu – GAM Systematic

 

Over the last decade, we have seen significant advances in machine learning across a wide range of fields. In many cases, this has come from applying very complex models – often containing tens of thousands of parameters – to extremely large datasets, commonly containing millions of examples. These applications are often described as ‘big data’ problems.

However, there is a related category of problems where the amount of available data to train a machine learning model is fundamentally limited, which we refer to as ‘small data problems’. Small data problems are very common in finance and need to be approached in a very specific way since in most cases, techniques designed to solve big data problems simply do not work well when applied to small data sets.

One such example of a small data problem is the study of large earthquakes. High quality historical records of earthquakes start around 1900. However, since then there have been around only 100 earthquakes of magnitude 8.0 or greater worldwide, as shown in Figure 1. Importantly, the issue is not that we did not look hard enough for data. We already have the complete dataset, but it is small.

Figure 1: Locations of the largest earthquakes since 1900

Source GAM, USGS

 

The tell-tale signs of small data

There are a number of signs that one might be working with small data:

  • Time series: If the data is associated with a particular point in time on a specific date, there a high chance of a small data problem. This is especially likely to be the case when dealing with data that is only periodically available, which is common for economic data.
  • Rarity: Does the data represent real world events, and do those events occur rarely in nature? This is the earthquake situation outlined above.
  • Aggregate: Is the data aggregate data? If the data represents whole countries or already represents a global aggregate, there is likely to be a small data problem. With the exception of astronomical data, we normally only have data from one planet to work with.
  • Correlated: If the data contains a high degree of internal structure or correlation, it is likely that there are fewer independent data samples, particularly if the dataset is noisy.

It turns out that many problems in finance satisfy all of these criteria. Finance consists of both big data and small data problems and the challenge is to be able to differentiate one from the other. In our latest white paper, we discuss in greater depth some examples of small data problems in finance and outline some of the approaches that can be applied to address these challenges.

Click here to view the GAM Systematic white paper in full.

Click here to view the video on the same topic, presented by Chris Longworth and Silvia Stanescu.