The Truth About Sparsity: Part 1
Throughout my five-plus years at Anaplan the conventional wisdom has been that we should eliminate sparsity in order to model efficiently. Throughout the last year, as part of the PLANS standards initiative, we have been critically re-evaluating all of the existing best practices and techniques and sparsity was included as part of that review. In this article (part one), I will explain what sparsity is, and why you shouldn’t be scared of it.
What is Sparsity?
Anaplan is a multi-dimensional planning platform, based on modules that hold data and calculations. These modules are primarily made up of lists (or dimensions) that describe the different aspects of the data. In most cases these modules contain more than one list, making them multidimensional.
Let’s use the following as an example. Assume we have a module containing the following dimensions:
- Customers
- Products
- Channels
- Monthly timescale of 2 years by month
If every customer sold every product, in every channel, every month, we would have dense data. In this scenario, 100 percent dense. However, in the real world, this situation is extremely rare—it’s very likely that some products are only sold by some customers, not all channels or months will contain data points.
In a multi-dimensional paradigm, having zero or null data is how we define the level of sparsity (the “gaps”). A sparse dataset (stored in a module) is where there are more gaps than data or the product of calculations.
Why is Sparsity Deemed a Problem?
So, taking our example above, let’s assume we have the following items for each dimension:
- 100 customers
- 1000 products
- 10 channels
- 26 time (2 * 12 months + 2 Year totals)
A single line item using this combination will require 26 Million cells (100 * 1000 * 10 * 26). This table shows the amount of memory required by each data type:
- Boolean: 1 byte = 0.026Gb
- List, Date, Time: 4 bytes = 0.104Gb
- Numeric = 8 bytes = 0.208Gb
- Text: 8 bytes = 0.208Gb
These values are the same regardless of how big the number is or how many characters are in the list format or text string.
The majority of the data in Anaplan modules contain numbers, so you can see why, with a lot of data, the size of the model can increase quite dramatically as we add more items to a list or extend the model timescale. This is the primary reason for sparse modules being deemed a problem: they are large!
Historically, proactively modeling for sparsity management was a key technique. As we evolved our thinking through deeper analysis and gained a better understanding of the way our customers used the Anaplan platform, we moved away from the approach of keeping model sizes as small as possible to reduce workspace usage.
In practice, this meant that instead of creating modules with a lot of sparsity, modelers created new structures by combining lists for only the valid combinations that existed in the dataset. The resulting hierarchy has many names—flattened, numbered, concatenated, and combined—but the result is the same and looks something like this:
Customer 1
- Product 1
- Channel 1
- Channel 3
- Product 2
- Channel 5
- Channel 2
- Product 3
- Channel 4
- Channel 10
Customer 2
- Product 4
- Channel 7
- Channel 9
- Product 5
- Channel 6
- Channel 8
This approach results in a much smaller cell count because the resulting module now only contains two dimensions (the combined hierarchy list and time). This combined hierarchy is dense as only valid combinations exist and it eliminates most of the zero values (although there may still be a little sparsity in the month dimension). This is an effective technique for reducing the size of a model, but it does come with some issues, which I will touch on in part two of this article.
But first, let me dispel some other myths associated with sparsity.
Myth 1 - The bigger the model the worse it will perform
This is totally untrue. Yes, the larger the model the more data it will contain, but there is no direct correlation between size and performance. The performance of a model is based on the design of the models and the calculation structures, not simply the size. To give you some context, the worst-performing model we have seen in the field was less than 1Gb in size but took more than thirty minutes to open. We have many models that are more than 100Gb and open in under two minutes. Well-designed models perform at scale.
Myth 2 – Sparse modules are inefficient
I would partly agree with this, but only in relation to module size (see Myth 3). Sparse modules are not inefficient when it comes to calculations. Anaplan’s engine (the Hyperblock) is designed to work with multi-dimensional structures. At its heart, the Directed Acyclic Graph (D.A.G.) indexes data to calculate only what is needed when upstream data points have changed. So, once the model is open — and a full calculation has been performed — the calculations thereafter are very efficient. Anaplan doesn’t re-calculate zeros as there is nothing to do; nothing has changed.
Myth 3 – Multi-dimensional modules are always bigger than those with a combined hierarchy list
This relates to the technique I referred to earlier. This is not as black and white as the previous two points, but, in certain circumstances, this premise is entirely false. Let me give you an example from the field.
One of the long-standing techniques for dealing with transactional data was to combine all “dimensions” into a single-entry or transaction list. Using a unique key, that includes the time period (e.g. month), the records are imported and stored in Anaplan as single rows in a one-dimensional module. The resulting module would look something like this:
The module and structure are 100 percent dense; job well done! Well, no.
We looked at the impact of storing data in a two-dimensional module by removing the timescale from the key and adding time as a dimension in the data module. The results were surprising!
Have a look at the size of the data:
- 2 years of data by month
- 3 elements to create uniqueness
- 80 percent sparse
- Model 1 used a transaction key that included the date
- Model 2 was multi-dimensional with time as a dimension
You can see that not only did Model 2 open more than 90 percent faster than Model, 1 but it was also more than 75% smaller. How can that be, given the adage that sparse multi-dimensional modules take up more space than dense modules? Well, the reason is list items themselves also make up part of the model size. I didn’t mention it earlier when discussing the cell count calculation because lists don’t add to the cell count itself, but each list item uses 500b, so the bigger the list, the more space used.
Look at the size of each unique transactions list in the two models. Removing the date from the key resulted in the creation of only 300K unique transactions rather than 7.3M. In turn, this results in a much smaller Transaction Details module. The sparse module calculates more efficiently than the large dense module, and, combined with the smaller list, we see improved model-opening time.
Join me for part two, where I will discuss the pros and cons of multi-dimensional structures versus combined hierarchies.