PlanIQ and related data

YanivD
edited April 22 in Best Practices

Author: Evgy Kontorovich, Sr. Director Product Management at Anaplan.

Background

Customers are frequently wondering what can be used as related data. When thinking about related data, we can think about different types of “internal” time dependent inputs:

  • Historical and future promotions
  • Stockouts
  • Price fluctuations
  • Holidays and special events

Actual data may be different dependent on a specific use-case, for example when forecasting cloud or software license costs, related data can be in form of number of engineers or number of employees.

External “external” factors are also use-case dependent and include, among other things:

  • Weather
  • Traffic and mobility data
  • Currency exchange rates
  • Interest rates
  • Commodity prices
  • Consumer price indexes

Additionally, it may be helpful to provide additional related information in order to help algorithms learn from the past better.

Lastly, it may be useful to provide related data as an additional input even if historic seasonality may be sufficient to predict similar future behavior. We will further explore this scenario in one of the examples in this article.

Before we dive into the examples I would like to remind you that MVLR and Anaplan Prophet algorithms provided by PlanIQ are actually looking at several types of related data:

  • Related data provided by the user – when provided.
  • Holidays calendar provided by the user in form of related data or in form of built-in PlanIQ holiday calendars.
  • Historical behavior of the particular item (time series) – historical trend and seasonality components.

On top of this information, PlanIQ performs automated analysis of the data in order to identify leading and lagging indicators and all that information is the used as part of feature selection where the algorithms pick and choose only relevant information in order to use it as part of the forecast. You can review this article* for detailed information about this topic.

From the perspective of this article, we need to remember that algorithms are looking at both your historical data, as well as related data and historical behaviors.

Another thing to remember is that it is important to provide related data not only for the known past but also into the future. Most of PlanIQ algEorithms (except for CNN-QR) would ignore related data if it’s not provided in the future for the entire length of forecast horizon.

It is not difficult to provide future information for events that are “internal” to your business, for example promotions or changes in price. It is more difficult to do it when external factors are being used. In this case, our recommendation is to first forecast the external factor and then use the forecast as input into the main forecast. Even if your forecast is only directionally accurate it will allow you to generate a forecast based on an assumption that you control and it also allows you to perform what if analysis where you can adjust the external data forecast and observe the impact of the changes on the main forecast.

Dataset

For the discussion in this article I used the following champagne sales data (as a single time series) provided by this Kaggle dataset*. It covers monthly sales of champagne between 1964 and 1972. I have modified the timestamps in the data in order to “move it” into present and used most recent 5 years of this data (from 2018 to 2023).

The experiments

When looking at historical data it’s evident that the data has a clear yearly seasonality pattern with sales peaks around December and drops of sales around August.

It can be observed that overall trend is positive with a peak in end of 2021 and then there’s a certain drop or downward trend. In general, 2021 had higher actuals than years before and after. This is also evident in the Seasonality and Trend analysis data provided by PlanIQ.

We will explore and observe impact of several experiments:

  • Forecasting based on historical data only.
  • Forecasting based on both historical and related data.

During our experiments I started the forecasts from January 2021 and used actuals until December 2020. The forecast horizon for this experiment was 12 months. During several iterations I added 12 months of actuals and projected additional 12 months of forecast.

Experiments with MVLR

We will observe last period of forecast, between January 2023 and December 2023. When based on historical data only MAPE of P2 forecast (0.5 quantile) was 10.72% which is pretty accurate. Looking at explainability, we can see that the algorithm rightfully decided that yearly seasonality identified by PlanIQ is the most impacting factor in our forecast. Second impactful factor is “exponential upwards trend” that is identified by PlanIQ.

Another experiment that I performed included related data that I generated. I used two types of related inputs:

  • Holidays – indicating the holiday peaks in December and drops in August. In this case I used numeric values where regular months had a value of 1, December had a value of 100 and August a value of ‘-20’. The numbers are arbitrary.
  • Covid19 – indicating months of COVID pandemic. Months of 2020 and 2021 had value of 100 and other months had value of 0.

When using this related data to perform same forecasts I was able to reduce MAPE rate of P2 (0.5 quantile) to 5.21% which is a significant improvement.

Even visually, we can see that P2 curve is much closer to actuals during 2023 period. In terms of explainability, seasonality is still the most impacting factor but we can see much bigger impact of other inputs including the impact of COVID. It is possible that introduction of COVID related time series is making the algorithm more sensitive to trend changepoint that happened after 2021.

Experiments using Anaplan Prophet

When forecasting based on the same data using Anaplan Prophet we see a similar behavior. When looking at MAPE P2 forecast (0.5 quantile) based on historical data only between January 2023 and December 2023 the value of MAPE is quite high, it’s ±44%.

The most important factor here is yearly seasonality but we can see that the forecast in 2023 is pretty much on the same level as in 2021 and this is causing the high MAPE value. In this case, the customer could decide to use values between P1 (0.1 quantile) and P2 (0.5 quantile) as forecast baseline.

When introducing related data as specified above, MAPE of P2 is reduced to 27%. In this case, impact of August “lows” and December “highs” is higher, and it seems that the algorithm is giving related holidays data more weight than to seasonality alone. Additionally, it seems that COVID related data also helps the algorithm to be more sensitive to trend changepoint and by that reduces the over-forecasting in the majority of 2023 months. The customer could decide to use values between P1 (0.1 quantile) and P2 (0.5 quantile) as forecast baseline based on the specific use-case.

Comparing MVLR and Anaplan Prophet

In this experiment, we saw that for 2023 MVLR has outperformed Anaplan Prophet both with and without related data. However, if we look at 2022 data, Anaplan Prophet was more accurate than MVLR. In both cases the algorithms over-forecasted. This is probably because 2021 actuals were much higher than previous years and both algorithms clearly tried to continue the trend. However, while MVLR probably gave more “importance” to trend vs Anaplan Prophet that was more “reserved” in its predictions and eventually provided more accurate forecast.

Summary

In this article we have explored various impacts that related data can have on your forecast. Few points can be highlighted:

  • If your historical data has strong seasonal patterns, it may still be useful to amplify the impact of those patterns with an additional related data (just like I did with holidays in this example) – in case of several patterns in data, where both seasonality and trend exist.
  • Even though we’re not in COVID era anymore, some of your historical data may be impacted by COVID and it may make sense to flag months / weeks impacted by COVID and lockdowns in order to indicate to the engine that certain parts of history were not the norm.
  • Different algorithms may perform differently at different time periods. Some of our customers choose a winning algorithm based on past performance across several time periods. This helps them insure that over time selection of winning algorithm is more stable and predictable.

The experimentation was based on MVLR and Anaplan Prophet specifically because we wanted to explore impact of related data. In larger data collections you are invited to try out DeepAR+ and CNN-QR as well as Anaplan AutoML and Amazon Ensemble.

*Note: you must join the PlanIQ group to view the starred articles. Feel free to join!

Tagged:

Comments

  • Great article @EvgyK! This helps conceptualize what I believe most PlanIQ partner practitioners find a bit more abstract (related data).

    Per your note:

    "Another thing to remember is that it is important to provide related data not only for the known past but also into the future. Most of PlanIQ algorithms (except for CNN-QR) would ignore related data if it’s not provided in the future for the entire length of forecast horizon."

    Could you share the forward looking screenshot of your related data (Holidays, Covid)? Curious to see how you modeled this specifically on the Covid data since this will only exist in historical periods of your dataframe. I.e., a Covid "forecast" of 0 for 2022/2023/2024 does not mean null, correct?

  • Thanks for the comment.

    So I did actually put 0s in the COVID related data and when I'm configuring the export action I'm making sure to "include empty values" so that all the 0s are part of that data:

    This is how the overlay of historical sales data and related data looks. I used +100 to indicate COVID periods. And also combination of positive / negative values to amplify holidays impact (summer vacations, Christmas).

  • Excellent! Appreciate the response. Very creative work. Thank you for sharing.

  • Hi Evgyk,

    I am curious is related data only applied to the period it appears in or can the related be considered as a leading / lagging indicator.

    Example 1 I tag period 3 with a promotion. I know that period 2 and 4 will be impacted by the promotion…ideally I would like Anaplan to look at this history and push sales out of period 2 into the promotional period and pulling sales from period 4 into the promotional period based only tagging period 3.

    Example 2 I plant to sell 30 luxury cars in period 3 (indicated in relational data) and I expect to sell attachment products 1 to 2 periods later. I would like to provide my historical data for the attachment products and include related data for actuals / forecast of the Luxury car sales…again the sales in this case would lag.

    In both examples related data is provided not in the same period as the sales occurred historically. Hopefully this makes sense.

  • MVLR and Anaplan Prophet have an ability to automatically identify leading and lagging indicators - for more information -

  • Thanks very helpful…supper cool enhancement.