PlanIQ - How to manage NULL values?
In real-world forecasting applications, it is common for datasets to contain null values, which represent missing values for specific points in time. There can be multiple reasons for the presence of missing values. For example, a transaction may not have occurred, or a device or service that monitors data may have malfunctioned. In demand planning use cases, the reason for missing data may be due to a lack of a sale or an out-of-stock situation.
This article serves as a guide to help PlanIQ users deal with scenarios in which their datasets include missing values or ‘empty’ cells.
It is important to differentiate between a true zero and a missing value. Moreover, a dataset with many missing values (a sparse dataset) is different than a cold start scenario where little or no data exists because a certain product is new to the market. For further information, please refer to the New Product Introduction article.
Many missing values in a dataset may impair the forecast accuracy. This is especially true for more recent (later) data in the time series. Our recommendation is to not have more than 30% of missing values per time series (per item). PlanIQ limits the missing values per item to 50% in the historical data. If a dataset contains more than 50% missing values, PlanIQ will display an error message indicating that too many values are missing.
PlanIQ assumes that datasets that originate in Anaplan modules with records set to zero are true zeros and therefore will be treated as such. In addition, in instances where a custom time dimension is used (i.e. where the time dimension based on a list of timestamps), records with missing timestamps will be treated as zeros as well (rather than missing).
There are multiple ways to deal with missing values. For instance:
- Use the ‘___exclude_value’ column so that missing values are not interpreted as zero (see Anapedia for more details). If a value is marked with 1 using the ‘___exclude_value’ column, PlanIQ will automatically fill it in with the mean of the values around it.
- Manually review and fill in missing values.
- Aggregate the data by using reducing its frequency (e.g. instead of a daily frequency, aggregate the data to a weekly level).
- Aggregate multiple distinct items to a category of items based on item hierarchy or other dimensions such as location (for example, combining multiple cities to a state level).
One more way to deal with sparse datasets is to use robust forecasting algorithms such as CNN-QR and DeepAR+. These algorithms take longer to train and typically require more historical data, but are better suited to handle sparsity in time series data. In scenarios where it is unclear which algorithm to choose, it is recommended to choose Anaplan AutoML; PlanIQ will then explore different algorithms and select the one producing the most accurate forecasts for most of the items.
Finally, when using related data, and future values are missing, CNN-QR should be chosen as the forecasting algorithm since other algorithms expect each related time series to contain future values (with a length equivalent to the forecast horizon).
While time series datasets will often contain many missing values, PlanIQ enables users to maximize the value of their datasets and generate accurate forecasts using the aforementioned approaches.
Got feedback on this content? Let us know in the comments below.
Contributing authors: Nitzan Paz, Christophe Keomanivong, Frankie Wolf, Timothy Brennan, Andrew Martin, and Evgenya Kontorovich.