The Truth About Sparsity: Part 1

DavidSmith
edited August 19 in Blog

Throughout my five-plus years at Anaplan the conventional wisdom has been that we should eliminate sparsity in order to model efficiently. Throughout the last year, as part of the PLANS standards initiative, we have been critically re-evaluating all of the existing best practices and techniques and sparsity was included as part of that review. In this article (part one), I will explain what sparsity is, and why you shouldn’t be scared of it.

What is Sparsity?

Anaplan is a multi-dimensional planning platform, based on modules that hold data and calculations. These modules are primarily made up of lists (or dimensions) that describe the different aspects of the data. In most cases these modules contain more than one list, making them multidimensional.

Let’s use the following as an example. Assume we have a module containing the following dimensions:

  • Customers
  • Products
  • Channels
  • Monthly timescale of 2 years by month

If every customer sold every product, in every channel, every month, we would have dense data. In this scenario, 100 percent dense. However, in the real world, this situation is extremely rare—it’s very likely that some products are only sold by some customers, not all channels or months will contain data points.

In a multi-dimensional paradigm, having zero or null data is how we define the level of sparsity (the “gaps”). A sparse dataset (stored in a module) is where there are more gaps than data or the product of calculations.

Why is Sparsity a problem?

So, taking our example above, let’s assume we have the following items for each dimension:

  • 100 customers
  • 1000 products
  • 10 channels
  • 26 time (2 * 12 months + 2 Year totals)

A single line item using this combination will require 26 Million cells (100 * 1000 * 10 * 26). This table shows the amount of memory required by each data type:

  • Boolean: 1 byte = 0.026Gb
  • List, Date, Time: 4 bytes = 0.104Gb
  • Numeric = 8 bytes = 0.208Gb
  • Text: 8 bytes = 0.208Gb

These values are the same regardless of how big the number is or how many characters are in the list format or text string.

The majority of the data in Anaplan modules contain numbers, so you can see why, with a lot of data, the size of the model can increase quite dramatically as we add more items to a list or extend the model timescale. This is the primary reason for sparse modules being deemed a problem: they are large!

Historically, proactively modeling for sparsity management was a key technique. As we evolved our thinking through deeper analysis and gained a better understanding of the way our customers used the Anaplan platform, we moved away from the approach of keeping model sizes as small as possible to reduce workspace usage.

In practice, this meant that instead of creating modules with a lot of sparsity, modelers created new structures by combining lists for only the valid combinations that existed in the dataset. The resulting hierarchy has many names—flattened, numbered, concatenated, and  combined—but the result is the same and looks something like this:

Customer 1

  • Product 1

    • Channel 1
    • Channel 3
  • Product 2

    • Channel 5
    • Channel 2
  • Product 3

    • Channel 4
    • Channel 10

Customer 2

  • Product 4

    • Channel 7
    • Channel 9
  • Product 5

    • Channel 6
    • Channel 8

This approach results in a much smaller cell count because the resulting module now only contains two dimensions (the combined hierarchy list and time). This combined hierarchy is dense as only valid combinations exist and it eliminates most of the zero values (although there may still be a little sparsity in the month dimension). This is an effective technique for reducing the size of a model, but it does come with some issues, which I will touch on in part two of this article.

But first, let me dispel some other myths associated with sparsity.

Myth 1 - The bigger the model the worse it will perform

This is totally untrue. Yes, the larger the model the more data it will contain, but there is no direct correlation between size and performance. The performance of a model is based on the design of the models and the calculation structures, not simply the size. To give you some context, the worst-performing model we have seen in the field was less than 1Gb in size but took more than thirty minutes to open. We have many models that are more than 100Gb and open in under two minutes. Well-designed models perform at scale.

Myth 2 – Sparse modules are inefficient

I would partly agree with this, but only in relation to module size (see Myth 3). Sparse modules are not inefficient when it comes to calculations. Anaplan’s engine (the Hyperblock) is designed to work with multi-dimensional structures. At its heart, the Directed Acyclic Graph (D.A.G.) indexes data to calculate only what is needed when upstream data points have changed. So, once the model is open — and a full calculation has been performed — the calculations thereafter are very efficient. Anaplan doesn’t re-calculate zeros as there is nothing to do; nothing has changed.

Myth 3 – Multi-dimensional modules are always bigger than those with a combined hierarchy list

This relates to the technique I referred to earlier. This is not as black and white as the previous two points, but, in certain circumstances, this premise is entirely false. Let me give you an example from the field.

One of the long-standing techniques for dealing with transactional data was to combine all “dimensions” into a single-entry or transaction list. Using a unique key, that includes the time period (e.g. month), the records are imported and stored in Anaplan as single rows in a one-dimensional module. The resulting module would look something like this: 

The module and structure are 100 percent dense; job well done! Well, no.

We looked at the impact of storing data in a two-dimensional module by removing the timescale from the key and adding time as a dimension in the data module. The results were surprising!

Have a look at the size of the data:

  • 2 years of data by month
  • 3 elements to create uniqueness
  • 80 percent sparse
  • Model 1 used a transaction key that included the date
  • Model 2 was multi-dimensional with time as a dimension

You can see that not only did Model 2 open more than 90 percent faster than Model, 1 but it was also more than 75% smaller. How can that be, given the adage that sparse multi-dimensional modules take up more space than dense modules? Well, the reason is list items themselves also make up part of the model size. I didn’t mention it earlier when discussing the cell count calculation because lists don’t add to the cell count itself, but each list item uses 500b, so the bigger the list, the more space used.

Look at the size of each unique transactions list in the two models. Removing the date from the key resulted in the creation of only 300K unique transactions rather than 7.3M. In turn, this results in a much smaller Transaction Details module. The sparse module calculates more efficiently than the large dense module, and, combined with the smaller list, we see improved model-opening time.

Join me for part two, where I will discuss the pros and cons of multi-dimensional structures versus combined hierarchies.

Tagged:

Comments

  • @DavidSmith Thank you! This up-ends a lot of our original notions of best planning practices. One other potential myth I've heard is that a hyphen-format for zero-value lines uses slightly less memory than a Zero. Is that true?

  • Thanks for the feedback @Darin_Seibel .  I would wait until Part two for the full picture.

     

    Regarding your question, I doubt it.  The - is just a display format, so the number is still the number!

     

    I will check and let you know

    David

  • @DavidSmithGreat article! Appreciate your approach with examples and comparison metrics.
    Transactional Load vs. Dimensional always a big question for many Anaplanners (even for experienced ones) as you could implement it in both ways.
    I see a lot of the experienced people prefer to use pure transactional approach (to use anaplan as a database) - but it is not what we should do.
    Let's prepare some more articles or study courses that could help experienced people to design more optimal solutions.

  • @nikolay_denisov 

    Thank you!

     

    As we revise our Platform learning, we are incorporating different techniques so we will ensure we cover this and similar topics as we finalise the content

     

  • When data/properties are loaded as list properties, are the sizes you listed above (byte count based on format) still the same or are they different?

  • List Properties are the same as line items, so no difference

  • @DavidSmith - in your summary table, you did note that import time is slower with the dimensioned module versus the flat table. I take this to mean that we still believe importing using "2D" tables is faster, correct? Do you know if there is any import timing difference in importing to text-formatted line items versus list-formatted line items? 

  • Yes, whilst the import itself is quicker, you have to bear in mind the size impact and the corresponding time to open the model.  So the summary from that table, is it is better to sacrifice a little on the import time but gain a lot on the model size and opening.

     

    I recently did a test in loading different formats and there was very little difference.  Formatted lists and text were slightly longer than numbers, dates and booleans, but not significantly. 

    What we do know is that loading from a module to a module is much faster than loading from a text file to a module (even after the file has been loaded to the cloud).  The difference is all in the read (i.e. collection of data).  In a test of 12 months worth of data for 700,000 rows, the text file took 36s to import, whereas the equivalent dataset from a module took 5s

     

    Stay tuned for part two shortly!!

  • @DavidSmith Thanks a lot for this article. It is extremely clear and it helps a lot to review the way we design. Actually, I always think of sparsity once I see the size of my model growing, but I need to review entirely my methodology! 

  • @DavidSmith the result with flat data with time dimension applied is surprising and counter-intuitive. Thanks for the article, will keep in mind when building.

    PS - I think there's a typo in LI size calculation. 26 mln cells of numeric data would be ~0,194Gb and List, Date, Time should be around 0,097Gb.

  • @Hayk 

    Thank you; and I've corrected the Gb, although I have slightly different numbers (they are approximates anyway!)

     

    But, to your point, it was a surprise initially, but the key area that is overlooked is the 500b that the list item takes up.  That accounts for a lot of the model size (but not cell count)!

     

    David

  • Thanks again for sharing! Thanks @AndrewN for sharing this article!!

  • @DavidSmith great guidance for beginner like me! thanks 

  • @Khanna 

    Thank you and I'm glad it helped.

    The most important thing you can do when starting out with Anaplan, is to think differently.  Anaplan is not like other tools or spreadsheets.  It has elements of those, but you shouldn't build Anaplan that way.  Leave the techniques you used in other products and try and understand multi-dimensionality

    Finally think about "what am I actually trying to solve", not "what did I do before" and that will help you

     

    Oh, and read the Planual!

    https://community.anaplan.com/t5/Planual/ct-p/Planual

     

    David

  • @DavidSmith  thanks for the feedback I will follow Planual!

  • @DavidSmith- Great Article. Example really helps to understand the point here. Thanks !

  • @DavidSmith Very Useful ! Thank you for sharing

  • Great stuff here. I think the list member size allotment is extremely overlooked. It's something I was unaware of. Thanks for sharing @DavidSmith