A time series is a sequence of numbers where each number is fixed to a particular moment in time. One of the most iconic examples of a time series is the stock market such as the daily closing price of a particular stock. Modern tech companies have their entire businesses built around time series, where daily user activity and hourly server demand are the cornerstones of the business. Yet it is not just tech companies, nearly all companies have daily sales volumes or resource allocations that are time series at the center of their business strategy.

Time series include a wide field of applications, from speech recognition to cyber-attack identification and industrial robotics maintenance predictions. The most common application for time series is forecasting, predicting the range of future values of the sequence.

Forecasting is a tricky area of data science for a simple reason: for real world scenarios, perfect predictions are impossible. That doesn’t mean forecasting is worthless, however, and the field has specialized tools and methods that allow for the production of reasonable forecasts in most cases.

As with all data science questions, it is important to understand what actions are intended to be taken as a result of the forecast outputs. There are two common variations on forecasting. The first is trying to predict the single most likely outcome. The second is trying to prepare for more extreme events. While both can be done simultaneously, generally there is a bit of a tradeoff between how accurate a forecast is on average, and how well it captures maximum and minimum events. It is also important to disentangle goaling from forecasting. While forecasting can be used to guide goal decisions, oftentimes goals have political and psychological elements that models don’t account for.

The most basic approach to the uncertainties of forecasting is the generating of probabilistic forecasts. A point forecast is generated. This is the most likely outcome and generally is most familiar to users. In addition, probabilistic forecasts are then usually presented in an upper and lower bound forecast at a certain confidence level, for example with 90% probability sales will fall between 120 and 100 units with a most likely point forecast of 110 units. Many users of forecasts fail to fully take advantage of probabilistic forecasting, which helps businesses better prepare for fluctuations in the market.

From a technical side, dealing with uncertainty starts with improved data preparation. While there are many possible ways of handling the data, there are two main categories of preprocessing for time series forecasting: transformations and decompositions.

A common transformation is smoothing, usually a moving average that can take a rough and chaotic pattern and smooth it into a more predictable wave shape. Transformations usually sacrifice some data specificity in return for a more stable and predictable series. Decompositions, on the other hand, break data down into multiple components, for example allowing a trend component to be handled by a linear regression leaving the remaining patterns to be handled by another method.

If there is one thing to remember about time series data handling, however, it is to be extremely careful to avoid accidentally passing information forward in time. Modeling features might be created based on recent insights from the data, then retroactively applied to previous historical data. This can give models hints about what later happened that lead to delivering unreasonably good results in historical testing - results which then applied to predicting the true future, won’t be reproducible.

The same problem applies to cross validation. Cross validation is usually how a model is chosen. It involves looking at a model across multiple tests to see how consistent it is. Given data from 2014 and 2015, predict 2016, then given data from 2014 to 2020, predict 2021. Models that consistently do well in the past are usually the safest bet for the future.

Here are the potential problems in cross validation. If you give a model data from 2020 and 2022, then ask it to predict 2021, it will usually do quite well at predicting what happened in between. However, that’s not representative of how well when blindly asked to predict the future. Another common issue is a data definition change. If, say, in 2018 two departments got merged, and the data after that time is different, it might be necessary to discard the older data. Usually it is safest to be overly-aggressive in removing problematic data, however sometimes older information may still be useful to a model, as many models are capable of and expecting to handle changes in patterns over time. Forecasting is a domain with plenty of potential pitfalls, but fortunately a well built forecasting platform can guide users away from most potential problems.

Before looking at the models, it is important to look at the scoring criteria for how a best model is chosen. There are a nearly infinite number of possible metrics, but a few are quite common. MAE is the mean absolute error and the simplest metric to use. An MAE of 10 for a stock price forecast would mean your forecast is, on average, off by $10 from the real value. MAPE and SMAPE, variations on a theme, are another metric with easy-to-interpret value: they are the percentage the forecast is off, on average. An SMAPE of 0.05 would mean the forecast is usually 5% off the real value. SMAPE is especially valuable when trying to compare forecasts of different scales - a MAE of 15 could be very good for a product selling thousands of units a day, but very bad for a product that sells only a handful of units. SMAPE removes that uncertainty, 5% always being a fixed amount of whatever volume. RMSE is similar to MAE but more heavily penalizes large errors - preferring forecasts with more small mistakes over a forecast with just a few, but very large mistakes.

Among the lesser known metrics, SPL, scaled pinball loss, also called quantile loss, is most useful for judging the accuracy of probabilistic upper/lower bound forecasts, while other metrics usually only look at the point forecast. For all the metrics mentioned here, and for most metrics in general, a smaller metric value is linked to a better, more accurate forecast. Advanced metrics exist that penalize over and under estimate, evaluate aggregation of multivariate forecasts, and assess the directional and shape accuracy of time series.

For most people, the fun begins with the actual model algorithms used to predict the future. These are loosely classed into three groups: naive models, statistical models, and machine learning models.

Naive models are those where almost no calculation is made. It is from this group that arguably the single best ‘one size fits all’ model comes: the last value naive. This model is simply predicting that today, tomorrow, and every future day are simply going to be equal to yesterday’s value. It works surprisingly well from everything from stock markets to weather, which contrasts with other models which, while often better on some data types, lack the broad versatility of this approach. The last value naive is often used as a baseline forecast for comparing new methods, and is incredibly fast to calculate.

At the highest level, statistical models are usually about trying to tune a mathematical formula to fit a series behavior. The simplest example is a linear regression, where the intercept and coefficient of the slope of a line are matched to the slope of the observed data. These can be incredibly powerful, explainable, and for modeling series like those generated by physical processes - astronomical sequences, for example, are often the best choice. Unfortunately, human behavior is often too erratic to be well modeled by even complex formulations.

Among statistical models, Prophet is a standout. While rarely offering the best possible accuracy, it is remarkably versatile and usually one of the safest choices for good models. It does well with seasonal data, but struggles when there is little training history, or there are large data definition changes.

That leaves the last group of machine learning algorithms, under which the trendy deep learning approaches fall. There have been many valiant attempts at conquering time series forecasting with deep learning: LSTMs, Transformers, DeepAR, Temporal Fusion Transformers, NBEATS, and so on. In other machine learning, the M5 competition was won not by neural nets but by a simpler LightGBM model coupled to plenty of data engineering. Generally these methods are much slower to train and predict, however they are theoretically more capable of handling subtle data patterns.

There is really no ‘one size fits all’ model for forecasting. Different data patterns favor different models. In particular, be wary of claims of new neural network architecture that claims to outperform all comers. Usually the most accurate results actually come from using naive, statistical, and machine learning models all together in an ensemble, although such ensembles may be too slow and expensive to build for everyday applications.

Instead of endless hours trying different models, usually the most practical outlet for development time is to search for proper regressors. Regression can be a confusing word, as it is used to mean slightly different things in different parts of the data world. However, in this case regressors refer to additional information, additional time series, that provide useful information to guide a forecast.

A simple example of using regressors is predicting the water level of a water reservoir fed by a large river. A regressor on how much rain has fallen will be very useful in determining water level. Going further, breaking down rainfall by distance from the reservoir might be an additional step. Nearby rain will immediately raise the water level, while rain far upstream from the source may take days before it affects the water level. Snowfall, regional temperatures, cloud coverage, and other weather conditions will also be important features to consider.

This example also is worth considering the difference between parallel and future regressors. Parallel regressors have no future information available, while future regressors can be known for some time in advance. As weather forecasts are readily available, it is possible to incorporate rainfall as a future regressor, with predicted rainfall amounts. The problem, of course, is there is potentially quite significant error in the incorporated weather forecasts which will then increase the error of the reservoir prediction. Simulation forecasting can be utilized in this scenario as well. For simulations, the model is trained on the known rainfall, then run with different future input rainfall levels to determine the effects of possible future rainfall scenarios on the levels of the reservoir.

As the example makes clear, regressors are extremely important to forecasting. However, use of regressors is often limited by data engineering challenges. Automating and maintaining a system for collecting weather forecast data, and other possible inputs, is often challenging or costly. The production challenges of forecasting are greater than many other data science systems due to the living, constantly updated and dynamic nature of the systems. A model designed to detect cats or dogs in videos will likely work unchanged for years to come, but a year-old forecast is mostly useless.

Forecasting requires a flexible and dynamic mindset. Uncertainty and potential mistakes are everywhere. However the end results, better guidance for the future, is well worth the effort.