OEG Best Practice: Formula optimization in Anaplan
Taking a few simple steps can give you significant performance improvements in your Anaplan model. Anaplan was designed to be built by business users, with simple formula constructs and no coding required. Formulas should not need to be long, complex or hard to create and understand. All too often, we see formulas that run to pages and pages, with inefficient and repeated constructs.
My previous article outlines the benefits of breaking down formulas in to manageable chunks, making them easier to understand, but more importantly, allowing the Anaplan engine to work more efficiently. The article can be found in the Formula structure for performance article.
I’d like to expand on that principle here.
As you may remember, PLANS is the over-arching Anaplan modeling standard, and the applicable areas here are P–Performance and N–Necessary. Actually, a word that I like to use is ‘appropriate.’ Create formulas that only use the necessary or ‘appropriate dimensionality,' and you will achieve better performance.
When creating a formula there is a very important question that I’d like you to ask:
“What elements of this line item expression actually affect the calculation?” If any of the lists or dimensions in the line item ‘Applies to’ do not apply an element of a formula, the expression does not belong in this module. Following the DISCO methodology, this formula should be housed in a separate module.
See the Best practices for module design article.
This is probably best illustrated with an example:
Let’s assume we have a module that is dimensioned by Company, Products, and Months. To allow us to connect to another part of the model, we need to create a combined text string that is the code of the Company and the code of the Product. We could create a text formatted line item with the following formula:
This will give the following result:
However, let’s look at the components of that formula more closely.
What affects the result of this expression? By the nature of the ITEM() formula, it can only relate to the Company list and nothing else. But that expression is contained in a line item dimensioned by Company, Product, and Month.
What is happening in the background (in the Anaplan engine) is as follows:
The same result is calculated for ALL Products and ALL months even though the expression does not relate to those dimensions.
The same happens for the second expression, CODE(ITEM(Product)).
The same result is returned for ALL Companies for ALL months.
And finally, let’s see what the “_“ is doing:
The text is “calculating” for ALL Companies, for ALL Products, for ALL months.
This is a classic case of over calculating. We do not need to calculate any of these expressions by month, and each expression only needs to calculate for the applicable list, or dimension. This is what I meant by “appropriate dimensionality"; we should only calculate expressions with dimensions that are appropriate to affect the result.
There is a simple rule to apply.
If the dimensions of the line item are greater than the appropriate dimensions for the expression, then the expression should be done elsewhere.
While it is possible to remove some of the dimensionality and create a subsidiary view, it is best practice to house this in a module of its own; it is easier to audit and maintain, and following DISCO, modules should have a clear and defined purpose that can be explained in a simple sentence.
So, taking the example from above, the first CODE() expression should be done in a module that only contains Company (e.g. Company Details), and the second CODE() expression, in a module only dimensioned by Product (Product Details). Following the DISCO methodology, you should have System modules for the Company and Products lists, so this is the natural place to house the new line items.
We can then bring them together in the target module.
This will give significant savings in calculation time. However, we can make this even more efficient.
If you remember Duncan Pearson’s “Ask me Anything” presentation, he discussed the memory impact of text strings.
We cannot avoid the use of strings within Anaplan, but we can ensure that the “joins” are done in the most efficient way to minimize the impact. In the example above we have three “joins” (Customer Code, “_” and Product Code).
We noted above that the Code expression only varied by the list dimension. The Code calculations are now fine, and as optimal as can be, but the “_” which joins the two strings is effectively still being created for all dimensions; Company, Product and month.
When you need to join text together in this way, it is likely there will be a common pattern used elsewhere within your model. In our example, the Company code could be the prefix for many other text strings. So, following the rule of “calculate once, reference many times,” we should put the &”_” join in the module by Company. As a rule of thumb, make the join in the smaller of the lists. In our case, we have 1,000 Companies and 3,500 Products, so it is more efficient to add this as a suffix to Company Code.
It is likely that the Code for Company line item might be needed in isolation, so it is better to create a new line in the “Company details” and calculate the string there.
The line item formula for Code to use = Code&”_”.
The final step is to replace the formula in the ultimate target with a reference to this intermediate line item.
The following chart shows the improvement in calculation time for the various options discussed above. All of the line items are within the original target module.
- Code Inefficient is the first calculation structure.
- Code Optimized is the second example, but still with three joins.
- Code Prefix refers to creating the join in the Product Details module (“_1274”) and joining to the Company code.
- Code Suffix is joining the “840_” with the Product code (as shown above).
As you can see, the Code Suffix line item is significantly more efficient than joining the three strings together in the target module. This approach does not materially add to the cell count of a mode but as illustrated, has a hugely beneficial effect on model performance.
One final step to ensure that calculations are fully optimized is to always consider whether the summary options are applicable for the calculations. In our example, by default, Text items default to None because combinations at the parent or total levels do not make sense. The code of the Company item is not applicable to its parent. So, for Text formatted items, there is nothing more to do. However, always check if the Summary options are needed.
See the Reduce calculations for better performance article for a reminder of how to reduce calculations in your models.
In summary, thinking logically about what is needed for each part of a calculation and applying small, simple changes can lead to significant improvements in model performance.
Author David Smith.
Great Article! Aren't these results affected by CODE & _ & CODE being executed also with extra month dimension? or was it run with extra subset on the line item without months? It's hard to believe that optimized formula with CompanyCode & _ & ProductCode is running 6x longer than CompanyCode_ & ProductCode.
Does it mean that CompanyCode & _ & _ & ProductCode would run 12x longer?
It is very unclear when this inefficiency is coming from as concatenation is being done anyway and "_" is a constant, so no calculation should be required.
Some extra explanation would be appreciated, but still - this is a fantastic article - please do more of these .1
Thanks for the feedback
There were no subsets in the example. But there are two factors at play here:
1. The number of cells on which the join is being performed
2. The number of joins performed
With CompanyCode & _ & ProductCode, the "_" is being "calculated" for 86M cells (Company, Product and Month). Although there are no calculations for "_", the text string itself accounts for a lot of memory - Duncan's Ask me Anything session (referenced in the article) has the full explanation. There are also three joins being performed in the final result module
In CompanyCode_ & ProductCode, the first "join" is only running on 1015 cells (number of Companies) and there are only two joins in the final result module
Any, yes, in CompanyCode & _ & _ & ProductCode, this is very bad; it has 2x 86Ms worth of "_"s and there are four joins!
So, in summary, minimise the number of joins and do them in the smallest module possible4
Just to add some additional insight - If your target module contains time (as in our example), it is very likely that the code structures will not vary by time, so it would be better to create a module by Customers and Products and construct the join in this module. Then in the target module, refer to the new code line item from the newly created line item.4
Hi @DavidSmith - Thanks for explaining in detail 🙂 I tried this approach for few of Lineitems.But I noticed this approach will not bring down the cellcount much. Is that correct?0
It is not always about bringing the cell count down, and in this case, the cell count might go up slightly if you needed to create new line items to perform the various parts of the formula.
However, it is much more efficient for the engine, so even if the cell count is higher, the performance will be way better.
For more explanation on why it is better to split line items and model more naturally, it is worth reading the following two blogs I wrote last year
I hope that helps
Just to expand on this a little. We often we this sort of construct using time formulas in modules dimensioned by Time and many other dimensions. It might be that you are using a subsidiary view, so the cell count is already optimised. As you might already know, we don't advocate Subsidiary views, unless by exception, and in this case, why are you doing the calculation in this module?. Do you have it repeated in another module? What if you or someone else in the future wants to build the same formula?; the formula will probably then be re-entered and worst case it wouldn't be set up as a subsidiary view.
So, I would still split the formula up and move it to a module dimensioned purely by time - that is the logical place for it, and conforms to PLANS
As this is only related by time, the effect will be marginal; it will increase the cell count slightly and hardly touch performance, but it is a good practice. The more you start modelling like this, the more you start thinking about the constructs of the formulae and where the component parts reside
Now, whenever you need to refer to the resulting formula, you point to the Time module - It has done the calculation once, and will only ever calculate once!
If you haven't already, please do look at the Planual as it outlines all things best practice
Cheers and happy modelling3