Performance Tuning - Using Subsets
Continuing on from an earlier post in the pursuit of model optimisation and performance gains (see here), I was hoping to get some clarification on another set of queries, relating to subsets.
I've seen some collateral insinuating that numerous subsets and/or subsets with most of the list flagged as part of the subset can be detrimental to performance; namely the following resources,
- Model Load, Save, and Rollback
What wasn't quite clear were the reasons why they lead to a decrease in performance in certain circumstances?
- we have a larger model (~77GB)
- most of our numbered lists (which typically have ~20k objects) have a 'core' subset which contains 98% of list itself with,
- several other subsets that are sporadically flagged depending on what type of classification they have.
In the below example, our #Lease list uses 'Subset 1' as the 'core' list with other subsets being used as additional classifications for specific use cases (and to also reduce sparsity).
The justification for using a 'core' subset as the main list is that we may need to de-subset certain leases/reduce sparsity at a point in time, and having the #Lease list being used in the 'Applies to..' wouldn't facilitate this.
Hopefully my explanation makes sense, but in short, I'm looking to understand whether the above scenario and implementation can have a negative impact on model performance.
To answer your questions, the answer is, sadly, it depends. With that said, sparsity is not a necessarily a bad thing. Take a look at these two articles written by David Smith which dispels the sparsity myths (The Truth About Sparsity). You are correct, many people use subsets to decrease the overall module/line item cell count, but there could be diminishing returns if you have to do those same or similar calculations on numerous subsets. Remember the C in DISCO (calculation modules). It is best to use more space for calculation modules (where the logic is clear/understandable/maintainable) where the calcs are done one time versus multiple times all over the place. Simple is often better. Also, with the use of SYS modules (and booleans), you can create logic with early exits which will help the models overall performance. On to your specific questions:
- Correct, utilizing a SYS module can help with this where a subset is not really needed.
- Correct, it depends but really you need to look at what the subsets are really used for? If for calculation purposes, then examine the logic and see if it is repeated for other Leases or other subsets. If the subset is used dashboards and helping users pick the appropriate member, that is a different story.
- Evaluating if the cell count is 0. Short answer is no. What people often forget and what I believe you are alluding to, is when you create a line item and put logic in it, that formula is not kicked off one time, but is really kicked of the number of cells of the list item. For example, if I have a line item with a formula parent(item(Product)) and that line item contains 1,000,000 cells, that means parent(item(Product)) is getting kicked off 1,000,000 times. There is no need for that which is where the SYS modules come into play. It is much better to have a SYS module with the parent already defined in a "flat" module (only dimensionalized by Product) and just reference that line item. It is much better to reference a line item 1 million times than it is to calculate it 1 million times.
- it depends on the list size. For this example, let's use the formula FindItem(). If the list is 100,000 members and the cell count of line item is 1,000,000, the formula is reading a list of 100,000 members 1 million times. So again, this goes back to #3.
- Early exists are your friend here as well as booleans. Again, subsidiary views aren't bad from a performance standpoint, but they are bad in that the logic is now hidden away in a module. For example, let's say Johnny and Mary are both model builders and are working on similar, if not the same logic. If Johnny puts the logic in subsidiary view, Mary will not know where to look, so she will likely recreate it. So now, the model is doubling up the logic which takes time. On the other hand, if Johnny had created a module for the calculation, then Mary would be able to able to see it, add to it if her logic was slightly different, and just reference it. Again, this not only speaks to performance of the model, but it cleans up the model so it is easier to maintain.
Hopefully I have answered your questions and this helps.
Great question! Essentially, the system sees the subset as a different list. So, in your example, you have one list with 3 subsets, but the system sees it as 4 separate lists. This is why subsets need to be carefully considered, especially on very large lists. When a new member is added (and there is a parent on the list), all data will get re-aggregated 4 times (once for the main list as well as the three subsets) in every module using the lists. So, if you remove the parent of the list (say for a transactional list or a currency list), no aggregation is needed because it is essentially flat.
Hope this helps,
Thanks for the response, @rob_marshall - interesting; I guess given the function of using subsets, multiple aggregation paths makes sense.
I imagine any data change (not just list object adds) is also propagated in the same way, if an object is utilised across multiple subsets. It also leads to the somewhat complex question of sparsity vs subsets, as we rely on subsets to reduce cell count.
With that said,
1. I assume that subsets which are small, or even single object subsets are more suited to being calculated via a line item using boolean checks for applicability and SUM in a separate module (which effectively bypasses the natural time/list aggregation)?
2. Similar to above and likely a 'it depends' question, but would models be better off with less subsets and more sparsity (with more intelligent boolean / early exists) to reduce calculation overhead?
3. Does the engine still evaluate line items where the cell count is nil / 0? Context for this query is that we have multiple use cases for our main model and when these use cases are not active, we disable calculations by clearing the applicable subset (but retain all line item calculations).
4. When using a list as a line 'Format' (not the Applies To), is there any discernible difference in calculation overhead between using the full list vs a subset; let's assume that the line is calculated and not used as a user-selectable pick-list?
5. Whilst I realise subsidiary views are frowned upon and best practice is to keep dimensions common, we typically use a subset of common dimensions within modules to reduce unnecessary cell count and calculation overhead, e.g. If the module is #Lease, we'll also include several lines as #Lease: Subset 2 because not all leases are relevant for all calculations. Is this likely to have the inverse effect of diminishing performance as the model is re-aggregating across the multiple subsets?
Appreciate there's a few questions in there; trying to dispel as much myth as we can before we invest the time into making changes that may not help (or may even be detrimental).
Amazing - really appreciate the detailed response @rob_marshall !
Based on your response and in the context of our model architecture, (1) and (2) are definitely the most onerous of the lot. It's also good to know in the instance of (3) that the evaluation is determined by the applicable cells/objects.
The linked materials also appear to be directly related to our scenario so I'll give them a suss as well.
And appreciate that for many of these scenarios, it'll depend; I imagine it'd rather subjective at times as well depending on the builder, but your input has definitely helped us in honing in closer to the crux of our performance woes.