OEG Best Practice: Formula structure for performance
Thinking through the results of a modeling decision is a key part of ensuring good model performance; in other words, making sure the calculation engine isn’t overtaxed. This article highlights some ideas for how to lessen the load on the calculation engine.
Formulas should be simple; a formula that is nested, or uses multiple combinations, uses valuable processing time. Writing a long, involved formula makes the engine work hard. Seconds count when the user is staring at the screen. Simple is better. Breaking up formulas and using other options helps keep processing speeds fast.
You must keep a balance when using these techniques in your models, so the guidance is as follows:
- Break up the most commonly changed formula
- Break up the most complex formula
- Break up any formula you can’t explain the purpose of in one sentence
Formulas with many calculated components
The structure of a formula can have a significant bearing on the amount of calculation that happens when inputs in the model are changed. Consider the following example of a calculation for the Total Profit in an application. There are five elements that make up the calculation: Product Sales, Service Sales, Cost of Goods Sold (COGS), Operating Expenditure (Op EX), and Rent and Utilities. Each of the different elements is calculated in a separate module. A reporting module pulls the results together into the Total Profit line item, which is calculated using the formula shown below.
What happens when one of the components of COGS changes? Since all the source components are included in the formula, when anything within any of the components changes, this formula is recalculated. If there are a significant number of component expressions, this can put a larger overhead on the calculation engine than is necessary.
There is a simple way to structure the module to lessen the demand on the calculation engine. You can separate the input lines in the reporting module by creating a line item for each of the components and adding the Total Profit formula as a separate line item. This way, changes to the source data only cause the relevant line item to recalculate.
For example, a change in the Product Sales calculation only affects the Product Sales and the Total Profit line items in the Reporting module; Services Sales, Op EX, COGS and Rent & Utilities are unchanged. Similarly, a change in COGS only affects COGS and Total Profit in the Reporting module.
Keep the general guidelines in mind. It is not practical to have every downstream formula broken out into individual line items.
Plan to provide early exits from formulas
Conditional formulas (IF/THEN) present a challenge for the model builder in terms of what is the optimal construction for the formula, without making it overly complicated and difficult to read or understand. The basic principle is to avoid making the calculation engine do more work than necessary. Try to set up the formula to finish the calculations as soon as possible.
Always put first the condition that is most likely to occur. That way the calculation engine can quit the processing of the expression at the earliest opportunity.
Here is an example that evaluates Seasonal Marketing Promotions:
The summer promotion runs for three months and the winter promotion for two months.
There are more months when there is no promotion, so this formula is not optimal and will take longer to calculate.
This is better, as the formula will exit after the first condition more frequently.
There is an even better way to do this. Following the principles from above, add another line item for no promotion.
And then the formula can become:
This is even better because the calculation for No Promo has already been calculated, and Summer Promo occurs more frequently than Winter Promo.
It is not always clear which condition will occur more frequently than others, but here are a few more examples of how to optimize formulas:
FINDITEM formula
The Finditem element of a formula will work its way through the whole list looking for the text item, and if it does not find the referenced text, it will return blank. If the referenced text is blank, it will also return a blank. Inserting a conditional expression at the beginning of the formula keeps the calculation engine from being overtaxed.- IF ISNOTBLANK(TEXT) THEN FINDITEM(LIST,TEXT) ELSE BLANK
Or
- IF BLANK(TEXT) THEN BLANK ELSE FINDITEM(LIST,TEXT)
Use the first expression if most of the referenced text contains data and the second expression if there are more blanks than data.
LAG, OFFSET, POST, etc.
If in some situations there is no need to lag or offset data, for example, if the lag or offset parameter is 0. The value of the calculation is the same as the period in question. Adding a conditional at the beginning of the formula will help eliminate unnecessary calculations:- IF lag_parameter = 0 THEN 0 ELSE LAG(Lineitem, lag_parameter, 0)
Or
- IF lag_parameter <> 0 THEN LAG(Lineitem, lag_parameter, 0) ELSE 0
The use of formula a or b will depend on the most likely occurrence of 0s in the lag parameter.
Booleans
Avoid adding unnecessary clutter for line items formatted as BOOLEANS. There is no need to include the TRUE or FALSE expression, as the condition will evaluate to TRUE or FALSE.- Sales>0
Instead of
- IF Sales > 0 then TRUE ELSE FALSE
Author David Smith.
Comments
-
Great recommendations!
2 -
Hi David,
Thanks for your post.
I have a question regarding FINDITEM tip. For example we know that we'll have max 5% text fields as a BLANK. With your proposal we save FINDITEM lookup time for those cases only (which is 5%) and we make engine to compute IF statements for 95% of other cells. Doesn't it make calculation slower in such cases?
Thanks,
Hayk
2 -
Hi Hayk
Thanks for your question and your interest in the article.
The check for a blank is very fast, but wether this outweighs the time to traverse the list is more difficult to prove. If the list is very large (say over 1 million items), I think it would be worth the check. For smaller lists with a small percentage of blank, it is probably not needed.
As with a lot of modelling decisions, the answer is often "it depends", we tried to propose a Best Practice that would cover most situations and provide a solution or way of thinking if you noticed some performance degredation. We have seen some huge performance gains from putting in the checks.
What I would say, it is always best to check. If you remove the formula, hit enter and then paste it again and hit enter, this will force a full recalculation of the line item. So, time it with the check and then time it without.
Let me know what you find out, but we are going to run some internal tests to see if there is a threshold after which it makes a significant difference.
It is an interesting question and there are some functions that have a built in check for blanks, but FINDITEM isn't one of then; we have asked if we can include the built in check, which will negate the need for this checking.
Thank you again, because this kind of conversation prompts discussion and makes the product better.
4 -
Hi David,
Thank you for explanation, it is always good to know some background.
I've tested what you proposed on a flat-list module with 14K cells (1400 of them are blank). Also created a module with 200mln cells based on that flat-list. And can't see the difference in both cases. Guess it should be measured with some tools.
Thanks,
Hayk1 -
These recommendations are great, but far too often formulas are not this simple to break up or reeningeer to optimize in such a clean manner. When formulas get really long and complex, it is difficult to know where and how to break it up into manageable pieces. Is it possible to get more complex examples than the ones above and provide some guidance on best practices on how deal with them? I have seen one at my clients that is nearly three pages long in a word document and trying to figure a way to break it apart makes my head spin.
0 -
I agree that we often see much more complex formulae; I have spent many a day trying to understand and optimise horrible formulae. 3 pages is quite short! I've seen a 50 page formula.
There is no easy way to break it down, but my advice would be as follows:
1. Copy the formula to Word or Excel, for analysis. I hate myself for saying that, but often it is easier to split out the elements of the formula. Y
2. Look repeated expressions. From above, using FIND to highlight repeated words/expressions works well. These are candidates for splitting into separate line items. We often see the same parts of formulae repeated.
eg. IF X then a / b * c ELSE IF Y THEN a / d * c ELSE 1
Here a * c is common to both conditions and therefore can be split out and referenced. So create a new line item = a * c and the formula becomes:
"new line item" / IF X THEN b ELSE IF Y THEN d ELSE 1
or even better split that into two
create another line item = IF X THEN b ELSE IF Y THEN d ELSE 1
The original formula then is:
"new line item" / "another line item"
3. Break out conditions when there are multiple IFs. In a lot of cases, these can often be replaced by lookups (especially when IF(ITEM... is involved), but splitting out the various conditions makes it easier to de-bug and test and well as improving performance
4. Look for "patterns, variables and parameters". We often see the parts of a formula that follow similar strucutres and have numeric values as forms of parameters, e.g *3, *4, *5. These can often be split out and reconfigured to be more efficient
Also remember that if you do split out the line items, it is unlikely that you will need the summary options turned on, so the cell count of the additional line items can be minimised
I hope that helps
David
8 -
with Finditem is there a difference between
IF ISNOTBLANK(TEXT) THEN FINDITEM(LIST,TEXT) ELSE BLANK
VS
IF ISNOTBLANK(FINDITEM(LIST,TEXT)) THEN FINDITEM(LIST,TEXT) ELSE BLANK
in-terms of performance ?
0 -
It is always better in terms of performance to split components out so the first one will be more efficient.
Also, you can see that the expression FINDITEM(list, text) is repeated
"calc once, reference many times"
Also you could even get rid of the ISNOTBLANK if you know there are no blanks in the data
FINDITEM(Item, Text) is fine in that case
Hope that helps
David
1 -
Instead of IF else, can we use Factor or Multiplier,
For eg. We have the formula IF Weekly_Factor=1 then Weekly_Revenue else Monthly_Revenue
can we re-structure it to Weekly_Revenue*Weekly_Factor+Monthly_Revenue*Monthly_Factor
will it help in improving Performance
0 -
Thank you @DavidSmith for this great article.
There is a small Typo in the Lag paragraph:
IF lag_parameter = 0 THEN 0 ELSE LAG(Lineitem, lag_parameter, 0)
Should be:
IF lag_parameter = 0 THEN
Lineitem
ELSE LAG(Lineitem, lag_parameter, 0)
And
IF lag_parameter <> 0 THEN LAG(Lineitem, lag_parameter, 0) ELSE 0
Should be:
IF lag_parameter <> 0 THEN LAG(Lineitem, lag_parameter, 0) ELSE
Lineitem
1