The views and opinions expressed in this article are those of the thought leader as an individual, and are not attributed to CeFPro or any particular organization.
By Adam Behrman, Head of Model Risk, Investors Bank
What challenges arise with building models on historic data?
First, consider your historical data completeness and accuracy. The data you have available today is the derivative of prior decisions on what would be collected, what format it would be collected in, the frequency of collection, etc. As storage space got exponentially cheaper and technology exponentially improved, more specific and frequent data was captured. Also, better controls and tools have been put in place to improve accuracy.
There is an inverse relationship between length of look back time and the availability of data you are looking to analyse. A common pitfall is the form of the data is changing in a non-obvious way, that is, look out for data that was previously captured as a string and is now represented as a number, sometimes in the same field. Unique to historic data is that the sampled data is difficult or impossible to re-sample since by definition the captured observation occurred in the past. To verify or re-evaluate a data point from scratch you would need a direct historical record, for example a video tape of the occurrence, or be able to recompute the data point from other directly recorded data points. Both situations are rarely possible, and the historical data you have is the best you can rely on.
Second, modelers must be aware of the temporal effects on their data. Some modeling approaches, such as time series forecasting, are explicit in how they consider relationships over time. Less obvious is the a priori decision that defines the unit of time. Analysis of stock open-high-low-close prices provides excellent examples of the effects of changing the definition of time. For example, in high-frequency trading, traders initially began by looking at small intervals of time units, such as seconds. However, due to the natural, repeating occurrences of workday schedules, it became clear that there was much more volume and trading activity during the periods immediately around market open and market close compared to much less activity during lunch time. This multi-modal distribution of activity resulted in poor models with biased estimates. Traders identified they could improve on their analysis by first moving to tick-based analysis, where a period was identified by several ticks, such as a time interval is always 1,000 ticks, and eventually to volume-based and dollar-based analysis. Using dollar-based time periods has shown to produce samples with better statistical properties for modeling. Taken in the other direction, investment managers with long-term time horizons benefit by reducing the frequency of their observations, ignoring intra-day or even daily movements which can introduce significant noise.
Modelers may have a comfort working with certain approaches that have empirically had a good and stable fit, but the realities of a disruptive event could change the appropriateness of the approach. A good example are short-rate models, such as Vasicek (1977) and Hull-White (2006) that permit negative interest rate outcomes versus Cox-Ingersoll-Ross (CIR, 1985) that does not allow for negative outcomes. Prior to the Great Recession, it was largely accepted that interest rates would not go negative. Following the GR, negative interest rates were encountered in global markets and thus their consideration became a necessity. Modelers previously relying on CIR or another zero-bound approach, have had to strongly consider the implications of that decision and whether an alternative approach is more prudent, based on the changing behaviour of the underlying data.
All this is to say, you really must understand your data, and particularly with historical data, its relationship to time.