In real-world forecasting applications, it's common for data to have null values. These values are missing values for certain points in time. There can be multiple reasons for the presence of missing values. For example, a transaction may not have occurred, or a device or service that monitors data may have malfunctioned. In demand planning use cases, the reason for missing data may be due to a lack of a sale or an out-of-stock situation.
This article serves as a guide to help Forecaster users deal with situations in which their datasets include missing values or 'empty' cells.
It's important to differentiate between a true zero and a missing value. A dataset with many missing values (a sparse dataset) is different than a cold start scenario where little or no data exists because a certain product is new to the market.
Many missing values in a dataset may impair forecast accuracy. This is especially true for more recent (later) data in the time series. Our recommendation is to have less than 30% of missing values per time series (per item). Forecaster limits the missing values per item to 50% in the historical data. If a dataset contains more than 50% missing values, Forecaster displays a message indicating that too many values are missing.
Forecaster assumes that datasets that originate in Anaplan modules with records set to zero are true zeros and will be treated as such. In addition, in cases where a custom time dimension is used (i.e., where the time dimension is based on a list of timestamps), records with missing timestamps will be treated as zeros as well (rather than missing).
There are multiple ways to deal with missing values. Several options are:
- Use the ‘___exclude_value’ column so that missing values are not interpreted as zero (see Anapedia for more details). If the value is 1 in the ‘___exclude_value’ column, Forecaster will automatically fill it in with the mean of the values around it.
- Manually review and fill in missing values.
- Aggregate the data by using reducing its frequency (e.g., instead of a daily frequency, aggregate the data to a weekly level).
- Aggregate multiple distinct items to a category of items based on item hierarchy or other dimensions such as location (for example, combining multiple cities to a state level).
One more way to deal with sparse datasets is to use robust forecasting algorithms such as LightGBM, TimesFM, and DeepAR. These algorithms may take longer to train and typically require more historical data but can better handle sparsity in time series data. In cases where it's not clear which algorithm to choose, it's best to choose Ensemble. Forecaster will then look at different algorithms and choose the one that gives the best forecast for each item.
While time series data often have many missing values, Forecaster can help you get the most out of historical data to make accurate forecasts.