OEG Best Practice: Formula structure for performance

Thinking through the results of a modeling decision is a key part of ensuring good model performance; in other words, making sure the calculation engine isn’t overtaxed. This article highlights some ideas for how to lessen the load on the calculation engine.

Formulas should be simple; a formula that is nested, or uses multiple combinations, uses valuable processing time. Writing a long, involved formula makes the engine work hard. Seconds count when the user is staring at the screen. Simple is better. Breaking up formulas and using other options helps keep processing speeds fast.

You must keep a balance when using these techniques in your models, so the guidance is as follows:

Break up the most commonly changed formula
Break up the most complex formula
Break up any formula you can’t explain the purpose of in one sentence

Formulas with many calculated components

The structure of a formula can have a significant bearing on the amount of calculation that happens when inputs in the model are changed. Consider the following example of a calculation for the Total Profit in an application. There are five elements that make up the calculation: Product Sales, Service Sales, Cost of Goods Sold (COGS), Operating Expenditure (Op EX), and Rent and Utilities. Each of the different elements is calculated in a separate module. A reporting module pulls the results together into the Total Profit line item, which is calculated using the formula shown below.

What happens when one of the components of COGS changes? Since all the source components are included in the formula, when anything within any of the components changes, this formula is recalculated. If there are a significant number of component expressions, this can put a larger overhead on the calculation engine than is necessary.

There is a simple way to structure the module to lessen the demand on the calculation engine. You can separate the input lines in the reporting module by creating a line item for each of the components and adding the Total Profit formula as a separate line item. This way, changes to the source data only cause the relevant line item to recalculate.

For example, a change in the Product Sales calculation only affects the Product Sales and the Total Profit line items in the Reporting module; Services Sales, Op EX, COGS and Rent & Utilities are unchanged. Similarly, a change in COGS only affects COGS and Total Profit in the Reporting module.

Keep the general guidelines in mind. It is not practical to have every downstream formula broken out into individual line items.

Plan to provide early exits from formulas

Conditional formulas (IF/THEN) present a challenge for the model builder in terms of what is the optimal construction for the formula, without making it overly complicated and difficult to read or understand. The basic principle is to avoid making the calculation engine do more work than necessary. Try to set up the formula to finish the calculations as soon as possible.

Always put first the condition that is most likely to occur. That way the calculation engine can quit the processing of the expression at the earliest opportunity.

Here is an example that evaluates Seasonal Marketing Promotions:

The summer promotion runs for three months and the winter promotion for two months.

There are more months when there is no promotion, so this formula is not optimal and will take longer to calculate.

This is better, as the formula will exit after the first condition more frequently.

There is an even better way to do this. Following the principles from above, add another line item for no promotion.

And then the formula can become:

This is even better because the calculation for No Promo has already been calculated, and Summer Promo occurs more frequently than Winter Promo.

It is not always clear which condition will occur more frequently than others, but here are a few more examples of how to optimize formulas:

FINDITEM formula
The Finditem element of a formula will work its way through the whole list looking for the text item, and if it does not find the referenced text, it will return blank. If the referenced text is blank, it will also return a blank. Inserting a conditional expression at the beginning of the formula keeps the calculation engine from being overtaxed.
1. IF ISNOTBLANK(TEXT) THEN FINDITEM(LIST,TEXT) ELSE BLANK
2. Or
3. IF BLANK(TEXT) THEN BLANK ELSE FINDITEM(LIST,TEXT)
Use the first expression if most of the referenced text contains data and the second expression if there are more blanks than data.

LAG, OFFSET, POST, etc.
If in some situations there is no need to lag or offset data, for example, if the lag or offset parameter is 0. The value of the calculation is the same as the period in question. Adding a conditional at the beginning of the formula will help eliminate unnecessary calculations:
1. IF lag_parameter = 0 THEN 0 ELSE LAG(Lineitem, lag_parameter, 0)
2. Or
3. IF lag_parameter <> 0 THEN LAG(Lineitem, lag_parameter, 0) ELSE 0
The use of formula a or b will depend on the most likely occurrence of 0s in the lag parameter.

Booleans
Avoid adding unnecessary clutter for line items formatted as BOOLEANS. There is no need to include the TRUE or FALSE expression, as the condition will evaluate to TRUE or FALSE.
1. Sales>0
2. Instead of
3. IF Sales > 0 then TRUE ELSE FALSE

Author David Smith.

Comments

PaulRitner

Great recommendations!

Hayk

Hi David,

Thanks for your post.

I have a question regarding FINDITEM tip. For example we know that we'll have max 5% text fields as a BLANK. With your proposal we save FINDITEM lookup time for those cases only (which is 5%) and we make engine to compute IF statements for 95% of other cells. Doesn't it make calculation slower in such cases?

Thanks,

Hayk

DavidSmith

Hi Hayk

Thanks for your question and your interest in the article.

The check for a blank is very fast, but wether this outweighs the time to traverse the list is more difficult to prove. If the list is very large (say over 1 million items), I think it would be worth the check. For smaller lists with a small percentage of blank, it is probably not needed.

As with a lot of modelling decisions, the answer is often "it depends", we tried to propose a Best Practice that would cover most situations and provide a solution or way of thinking if you noticed some performance degredation. We have seen some huge performance gains from putting in the checks.

What I would say, it is always best to check. If you remove the formula, hit enter and then paste it again and hit enter, this will force a full recalculation of the line item. So, time it with the check and then time it without.

Let me know what you find out, but we are going to run some internal tests to see if there is a threshold after which it makes a significant difference.

It is an interesting question and there are some functions that have a built in check for blanks, but FINDITEM isn't one of then; we have asked if we can include the built in check, which will negate the need for this checking.

Thank you again, because this kind of conversation prompts discussion and makes the product better.

Hayk

Hi David,

Thank you for explanation, it is always good to know some background.

I've tested what you proposed on a flat-list module with 14K cells (1400 of them are blank). Also created a module with 200mln cells based on that flat-list. And can't see the difference in both cases. Guess it should be measured with some tools.

Thanks,
Hayk

mjpearlman

These recommendations are great, but far too often formulas are not this simple to break up or reeningeer to optimize in such a clean manner. When formulas get really long and complex, it is difficult to know where and how to break it up into manageable pieces. Is it possible to get more complex examples than the ones above and provide some guidance on best practices on how deal with them? I have seen one at my clients that is nearly three pages long in a word document and trying to figure a way to break it apart makes my head spin.

DavidSmith

@mjpearlman

I agree that we often see much more complex formulae; I have spent many a day trying to understand and optimise horrible formulae. 3 pages is quite short! I've seen a 50 page formula.

There is no easy way to break it down, but my advice would be as follows:

1. Copy the formula to Word or Excel, for analysis. I hate myself for saying that, but often it is easier to split out the elements of the formula. Y

2. Look repeated expressions. From above, using FIND to highlight repeated words/expressions works well. These are candidates for splitting into separate line items. We often see the same parts of formulae repeated.

eg. IF X then a / b * c ELSE IF Y THEN a / d * c ELSE 1

Here a * c is common to both conditions and therefore can be split out and referenced. So create a new line item = a * c and the formula becomes:

"new line item" / IF X THEN b ELSE IF Y THEN d ELSE 1

or even better split that into two

create another line item = IF X THEN b ELSE IF Y THEN d ELSE 1

The original formula then is:

"new line item" / "another line item"

3. Break out conditions when there are multiple IFs. In a lot of cases, these can often be replaced by lookups (especially when IF(ITEM... is involved), but splitting out the various conditions makes it easier to de-bug and test and well as improving performance

4. Look for "patterns, variables and parameters". We often see the parts of a formula that follow similar strucutres and have numeric values as forms of parameters, e.g *3, *4, *5. These can often be split out and reconfigured to be more efficient

Also remember that if you do split out the line items, it is unlikely that you will need the summary options turned on, so the cell count of the additional line items can be minimised

I hope that helps

David

karank_*4578

@DavidSmith

with Finditem is there a difference between

IF ISNOTBLANK(TEXT) THEN FINDITEM(LIST,TEXT) ELSE BLANK

IF ISNOTBLANK(FINDITEM(LIST,TEXT)) THEN FINDITEM(LIST,TEXT) ELSE BLANK

in-terms of performance ?

DavidSmith

@karank

It is always better in terms of performance to split components out so the first one will be more efficient.

Also, you can see that the expression FINDITEM(list, text) is repeated

"calc once, reference many times"

Also you could even get rid of the ISNOTBLANK if you know there are no blanks in the data

FINDITEM(Item, Text) is fine in that case

Hope that helps

David

kumarshubham

Instead of IF else, can we use Factor or Multiplier,

For eg. We have the formula IF Weekly_Factor=1 then Weekly_Revenue else Monthly_Revenue

can we re-structure it to Weekly_Revenue*Weekly_Factor+Monthly_Revenue*Monthly_Factor

will it help in improving Performance

AnasB

Thank you @DavidSmith for this great article.

There is a small Typo in the Lag paragraph:

IF lag_parameter = 0 THEN 0 ELSE LAG(Lineitem, lag_parameter, 0)

Should be:

IF lag_parameter = 0 THEN Lineitem ELSE LAG(Lineitem, lag_parameter, 0)

And

IF lag_parameter <> 0 THEN LAG(Lineitem, lag_parameter, 0) ELSE 0

Should be:

IF lag_parameter <> 0 THEN LAG(Lineitem, lag_parameter, 0) ELSE Lineitem

saivm

Hi @DavidSmith

Regarding the formula with many calculated components, I have a question about the performance and calculation mechanisms when using formulas that reference line items either within the same module or across different modules. Specifically, I wonder if there are any key differences when summing line items in these two scenarios and if adding intermediary line items could impact performance.

Scenario 1: Cross-Module Line Item References
In this case, I’m creating a formula that sums line items from different modules. For example:

Formula A: Module X.Line Item 1 + Module Y.Line Item 2 + Module Z.Line Item 3

Scenario 2: Same-Module Line Item References
Here, I’m summing line items that are all part of the same module:

Formula B: Line Item 1 + Line Item 2 + Line Item 3 (all within the same module)

In both cases, I am summing multiple line items within the formula. However, I’m curious if there is any difference in the calculation mechanism or performance impact between referencing line items in the same module versus referencing line items from different modules especially when the dimensions are similar.

When creating intermediary line items to simplify or reuse calculations (e.g., creating a new line item that refers to an existing line item in the same or a different module), could this lead to daisy chain issues or unnecessary performance inefficiencies? For example, if I create a new line item that refers to another source line item and then reference the new line item within another formula, does this create redundant calculations or slow down performance, particularly in larger models?

rob_marshall

@saivm

I will answer for David as David left several years ago. To answer your question, there is not difference between getting calcs from within the same module or to another module. The trick though is to limit the number of calculations within the model. So, if you are bringing in calcs into a module just to be cleaner (also potentially daisy chaining), that is adding calculations the model is having to do.

jamancuso

@rob_marshall

Just to confirm, the change in the 2nd "if then" doesn't actually improve performance right? At the end of the day, the same # calcs will run for timesum(xxxx) and timesum(yyyy), regardless if summer promo occurs more. I've always assumed that was the case, and validated model calculation effort. Highest occurrence for "if then" would only matter for nested "if thens" and excluding the last "if then" in a nested situation (the bold applies to this situaton). Let me know if I'm thinking about this correctly

rob_marshall

@jamancuso

Can you help me out with which IF statement you are referring to? If it is the one with the TimeSums(), you should never use the timesum function when you have Time in the Applies To. This article was written in 2018 before we started performance testing the functions as well before The Planual came out (2019).

jamancuso

@rob_marshall Makes sense about the timesums and this article being from quite some time ago, just happened to stumble upon this when I was searching for a particular article. I was just referencing the very last enhancement suggested.

Going from:

if not summer promo and not winter promo then 0 else

if winter promo then timesum(xxxx) else timesum(yyyy)

To:

if no promo then 0 else

if summer promo then timesum(yyyy) else timesum(xxxx)

For the first if then, I get why there's a benefit making a change there. But for the second if then, the article indicates "This is even better because the calculation for No Promo has already been calculated, and Summer Promo occurs more frequently than Winter Promo." I don't think this provides a benefit because summer promo timesum and winter promo timesum calculate the same amount of times regardless of the order.

rob_marshall

@jamancuso

There is one more Summer Promo than Winter.

jamancuso

@rob_marshall Maybe I'm missing something obvious. But won't the summer promo & winter promo bools calculate 12 times (once for each month). And then timesum for summer promo will calculate 3 times and the timesum for winter promo will calculate 2 times, regardless of which one comes first. Am I missing something obvious?

rob_marshall

@jamancuso

You need those as that is transactional data for the Winter and Summer Promos, you need those by the months to know when to do the sums. What David was trying to point out is for the early exit and not doing the sums if there isn't a promo for that month (7 don't have promos while 5 months do). Since there are 7 months without a promo, then the early exit will be "hit" more vs. the 5 that have promos.