RANK | Model Optimization Team series

Hello! Welcome to the Model Optimization team's monthly blog post. This is an eight-part series, following the topics we see occurring most frequently in slow customer models. Our aim is to inform and guide how to build with best practice intentions for these model building topics.

Each month, two of our analysts will take you through how to spot and improve these formula structures! We hope these can help with your model performance.

All the best - Cianna Bramley (Model Optimization Manager)

……

This month's blog is brought to you by Ara May Baral and Lip Chean Soh.

This month, we’re going to tackle one of Anaplan's single-threaded functions — the RANK function — which we also see in our Model Optimization cases. In this article, we will learn why the RANK function can be problematic when it comes to speedy calculations, how to find it, and how to improve it.

Why is this bad?

Before we jump into why the RANK function can be bad for performance, let’s look at how Anaplan calculates formulas.

Anaplan uses a Hyperblock engine, which means that we can perform calculations in parallel at block level. This is also how multithreading happens in Anaplan; the engine can get a line item and split it into data blocks which will go to different calculation threads. This split allows the calculation to be calculated faster. Think of it as doing a task with multiple people — a task will be finished faster due to people dividing the tasks. To learn more about Hyperblock, you can check out this link.

However, some functions in Anaplan cannot split calculations; these are what we call single-threaded functions. The RANK, RANKCUMULATE, ISFIRSTOCCURRENCE, and CUMULATE with three parameters are all functions which are single-threaded (but in this article, we will only dive into RANK functions). The single-threaded functions are bad for performance because they tend to calculate without any help from other threads, making the calculations much longer compared to other functions that calculate multi-threaded. Especially when the RANK function is used with large lists, this will result in more work in the thread and eventually lead to poor performance. Think of it as doing a task with only one person; the task will be finished much slower as compared to doing it with multiple people.

Another case in which the RANK function is bad is when a RANK function has a fourth parameter, which is the ‘Include value’ argument in Boolean format. When this argument is FALSE, the function will give a NaN (not a number) value. This NaN value will not cause any performance issues in the RANK function, but it will have a significant performance issue if this NaN value is referenced by other line items.

How do we find it?

  1. Manual inspection: In the Modules section of the model, there's a tab called ‘Line Items’ that shows all the line items used in the model. These line items can be extracted so that it is easier to filter the line items that use RANK function. Under Modules > Line Item, click the ‘Export’ option to extract all the line items. Click the ‘Run Export’ to extract the line items in .csv format.
    In addition, in the blueprint section of the module, the line items can be observed if the formula uses RANK function. Make sure that the Blueprint View icon is highlighted so that we can see the list of line items that uses RANK function.
  2. MAPs: This report is provided by your Business Partner, which can also identify Planual infractions that include using RANK function.
  3. Model Optimization report: In our Model Optimization reports, we analyze the Best Practices for model building as well as Planual violations, which includes RANK.

How do we improve it?

In this article, there'll be three examples that uses RANK function and how to improve it.

Example 1

Let's look at the line item highlighted in red that is ranking the Volumes:

Formula:
Example 1 – RANK = RANK(Volumes, ASCENDING, SEQUENTIAL)

The syntax of RANK is:
RANK(Source values [, Direction] [, Equal value behavior] [, Include value] [, Ranking groups])

We need to check first the dimensions used in this line item:

As much as possible, we want to calculate RANK function across the smallest number of dimensions so that we can help this single-threaded function to only calculate over smaller lists instead of larger ones. Observing the ‘Applies To’ of the ‘source value’ argument of RANK, we can see that the ‘Volume’ line item is only applied to G3 Location, and P2 Product — the Month timescale doesn’t matter in this Volume line item. Therefore, we can set the ‘Applies To’ of ‘Example 1 – RANK’ line item to G3 Location and P2 Products.

The line item becomes:

We can also apply the idea of decreasing the dimension, if possible, to other functions that are single-threaded in nature such as RANKCUMULATE, ISFIRSTOCCURRENCE, and CUMULATE with three parameters.

Example 2

Here's another example of RANK that uses ‘Include value’ and ‘Ranking group’ arguments.

Formula:

Let's try checking the dimensions used within this formula:

We need to check if it is possible to reduce the dimensions. However, in this example, we cannot reduce the dimensions further. Even though the ‘Ranking Group’ line item only uses Month timescale, we cannot reduce the dimensions for ‘Example 2 – RANK’ as other line items (Revenue, Boolean Check) use G3 Location and P2 Products together with Month timescale.

What do we need to do then?

The ‘Example 2 – RANK’ line item has a fourth parameter, which is the ‘Boolean Check’ line item. If this ‘Boolean Check’ is FALSE, the result will be NaN. To prevent that, we will create a formula that will only process the TRUE value in the fourth parameter and force the NaN value to zero, using the IF THEN scenario. With this, we are helping the RANK function to only process the needed values.

The ‘Example 2 – RANK’ line item becomes:

However, we can still look at the line items used in the ‘Example 2 – RANK’ line item:

The RANK example (highlighted in blue) may be improved further by identifying better ranking criteria. As we see in this example, the fourth parameter includes a Boolean Check (highlighted in orange) that returns TRUE if the Revenue > 300. If we use Revenue > 300, the RANK formula might return ranking of 1 to 2000, assuming there are 2000 data points that fit the Revenue > 300 criteria. Therefore, we recommend the model builders to have a discussion with their business users on what will be the practical threshold for a more sensible ranking group to be created. Do the business users need ranking from 1 to 2000? If not, what's a more practical threshold so that the RANK formula returns a ranking from 1 to perhaps 100 (Revenue > 300 might be changed to Revenue > 2000 instead). Note that the criteria can be dynamic across different dimensions too, since different business entities might have different thresholds.

Another variation of this scenario is when the fourth parameter is TRUE, same as the example below:

Since the fourth parameter is TRUE, this formula only means that it will rank all the Revenue values with the given conditions – therefore, a NaN value will not happen in this example. The best possible approach is to assess the dimensions used in the line items involved in the formula and reduce the dimensions if possible.

Example 3

In this example, the line item uses all the parameters of the RANK function, wherein the line item will do a RANK() after the first RANK check.

Formula:

Checking the dimensions:

We cannot reduce the dimensions of the RANK further due to all the dimensions (G3 Location, P2 Products, Month timescale) are necessary in the calculation.

However, we can see in the formula that the ‘RANK(Revenue, ASCENDING, SEQUENTIAL, Boolean Check, Ranking Group)’ is repeated two times. By this, we can apply the idea of Calculate Once and Reference, which is also one of the Best Practices in model building. Instead of repeating the calculation, we can create an intermediate line item to only calculate the function, and then reference it to the example.

Step 1: Create ‘Intermediate Line Item’. This line item will have a number format, and will use the G3 Location, P2 Products, Time (Month) dimensions. This line item is where the RANK formula will be calculated, such that it will only rank the Revenue line item that is included in the ‘Boolean Check’ line item:

If the fourth parameter is FALSE, we should set the value to 0. There might be a case wherein a high number, such as 99999, will be used as a replacement for the value that is not included in the rank. However, it is still recommended to use zero as a false value as this high number might be reached if there are large list combinations.

Step 2: Replace the original RANK functions to reference the intermediate line item.
The line item now becomes:

In summary

It is important to know that there are functions that calculate in single-threaded, such as RANK. Especially with these functions, we must always assess first if there is a way to reduce the dimensions used in the function so that we can also help it to only calculate the important values. The approaches in the examples above can help you in improving RANK functions in your models.

…..

Check out the other articles in the series: