Planual Explained - Day 3

"Rule 1.06-02": Article 1, Chapter 6, and Rule 2 “Don't use subsets on large lists.” It is better to create a list on its own if the Subset is more than 75% of the list. This goes against “Performance” of PLANS if you wish to create subsets on large lists

 

Here is how it was done in Pre Planual Era: Without checking the size of the lists we used to create subsets thinking it saves space and helps in model optimization. Little did we know that there can be a performance hit because of such large subset and at the same time with no space savings. For Example List A with 10,000,000 transactions having a subset which has got 75% occupancy, subset used to be created thinking it saves space for 25% of list items.

 

What is wrong with this method? First we need to understand what subsets really are? Subsets are essentially the lists within lists. List Subset items consume space as much as List items do (which is 500 bytes per item) even if that list or subset is NOT being used as a dimension in any module. When a large list with top level which has got one subset in it is being used in modules it impacts the Performance because the system has to aggregate the data not only for the lists but also for the subsets and re-aggregate in all those modules where this particular list and subsets are being used as dimensions. Performance also takes a hit when you add or remove subset items from such lists

Also there is a myth that ALL subsets help in space optimization. That is not true. Here is my analysis on it

 

A List with 10,000,000 List items in it will contain 5,000,000,000 Bytes of space which is roughly equal to 4.7GB. If we add a subset to this list which has got 75% occupancy of the Original list meaning the subset will have atleast 7,500,000 list items in it and will consume additional 3,750,000,000 bytes of space which is roughly equal to 3.5GB. List which was originally consuming 4.7GB space is now consuming 8.2GB Space (4.7GB from Original list and 3.5GB from Subset). Model builders have to take a judicious call on this whether that subset can save 3.5GB in due course of model building which in turn will depend on how many times that subset will be referenced and on how many intersections. Let’s see what happens when this list and/or subset is being used as a dimension in any module.

 

 

Format

Space Used

If List Used

If Subset Used

Diff (In MB)

Line Item 1

Number

8 Bytes

         80,000,000

       60,000,000

20

Line item 2

Number

8 Bytes

         80,000,000

       60,000,000

20

Line item 3

Time Period

4 Bytes

         40,000,000

       30,000,000

10

Line item 4

Time Period

4 Bytes

         40,000,000

       30,000,000

10

Line Item 5

List

4 Bytes

         40,000,000

       30,000,000

10

 

       280,000,000

    210,000,000

70

  

Note: Based on Simple module having a single dimension

 

As you can see using subset in a module saved 70MB of a space for 5 line items. This subset has to save 3.5GB of a space to Breakeven which in turn will depend on the number of times this subset is being dimensionalized by line items/modules

 

Here is how it should be done in Planual Way: Create a different list altogether instead of a subset for large lists.

 

It has got many benefits such as

  1. System will not have to aggregate the data for List and Subset at the same time and for modules.
  2. Only one list will be impacted upon import

 

 

Tagged:

Answers

  • Kudos @Misbah 
    I would have loved to see a list of Cons to creating a new list instead of using a subset as well.

    I clearly see the argument of size/performance (Great Job), however, as I mentioned before with the 3 competing forces (performance, usability, and sustainability) often you have to make a judgment call. 
    Having a subset as a separate list will require additional maintenance to make sure that any changes done to the original list are reflected in the new would-have-been subset.

     

    My question is.... At what point ( of list size for example) would the performance be hit to the extent that creating a new list benefits would outweigh the benefits of maintaining a single list?
    This way modelers would make an informed decision when deciding to create a new list instead of using a subset.

  • @einas.ibrahim 

     

    Great great question. I agree with you that Anaplan has to quantify the word "Large" in the rule but I don't think it is as easy as looks like to be because more the times large subsets are being referenced higher the performance impact. They might have to ignore the reference part and come up with the number on that.

     

    I personally am not a big fan of Subsets. On your point of maintenance It is a question of 1 Import in a process vs 2 imports in a process or 2 processes in a sequential manner. As long as it is automated I don't see any issue here.

     

    I don't see any cons of creating a list apart from having two jobs to maintain those lists. Do you have any please feel free to add.

  • @einas.ibrahim ,

     

    Agree, great question.  Not only should size of the subset (as well as the list) be taken into consideration, but also how often will it be used (many times vs. only a few times).  Again, this is not a hard and fast rule, but you will have to take the entire process into consideration.  For smaller lists like indicators or "Fake Versions", having multiple subsets should not be an issue.

     

    So, how this rule came about...We had a client who was importing 500k transactional members a month into a list of 10M members.  They needed to do logic on only the newly added members, so they created a subset.  Well, this subset contributed to performance issues because the process contained the following actions:

    • Load to list and add the subset (one action) - this created multiple aggregations of "both" lists (the main list as well as the subset).  In actuality, I believe the aggregation on a list is one of the slower aggregation methods in Anaplan.
    • Logic was done in a module on the subset
    • Subset was removed - again, this kicked off aggregations that were not needed.

    To get around this, we did a couple of things:

    • removed the top level of the transactional list as summations were not needed, so this actually removed the automatic aggregations
    • used the same boolean from the source and instead of turning on a subset, we used a properties module and turned on a boolean line item.

    In using the line item boolean, we were able to do the same logic without the performance issues a subset presented us while saving additional space as the subset was no longer being created (500k * 500 bytes).

     

    Hope this helps,

    Rob