Loading Transactional Data into Data Hub
Hi
1. Why do we create a concatenated list to load transactional load into data hub? Instead we can create individual lists as dimensions and load data.
2. Why do we create hierarchy lists as flat lists in Data Hub?
Thanks
Ibrahim
Answers
-
Excellent question. I'll answer these directly but first I HIGHLY recommend you read this data hub best practice article first by @rob_marshall . In this article @rob_marshall answers both your questions and a lot more that you're likely to be considering as you pursue data hub processes. You might also consider reading @DavidSmith article on the truth about sparsity. Even if you have to dimensionalize time with which may cause a bunch of empty cells, you're performance AND space will likely be smaller than if you don't dimensionalize time.
- Concatenated List - Transaction modules almost always use one list dimension (sometimes two, but very, very rarely). If time (day and above) is part of the transaction identifier then it should be dimensionalized too. Use a time range if you have to. The bottom line is this: each list item, as a minimum, will take 500 bytes + the space required to save your list item. If you concatenate the list using a well crafted delimiter (I like using the pipe "|") you will be able to load the transactions much faster and you will very likely save a lot of space. Then create a system module using that same list (but not time or any other dimension) to separate out the real list dimensions. Remember, calculate once, refer often!
- Flat Lists - DH uses flat lists because there should not be any need to aggregate data. Planning and Reporting should be done in the spoke application. You can use system modules to build the structured hierarchy in the spoke application. Aggregations in the data hub are generally unnecessary, even for auditing. Keep the DH as clean as possible. Some people will add a structured list in the DH for testing purposes but then quickly remove it once the test is completed.
Hope this helps. Try to read @rob_marshall and @DavidSmith articles. I promise you'll walk away with a completely new perspective.
2 -
Firstly, @JaredDolich has provide a first rate response but Im sensing that the explanation may be a little to long and somewhat too technical.
The primary reason comes from the nature of multidimensional data.
Imagine for every data point in a list that point is duplicated by the number of list items in any list which is added as a dimension. Therefore, a module can become massive in size very quickly as each dimension added increases the number of possible combinations by a factor equal to the number of items in the new dimension.
The problem with this is that for the vast majority of possible combinations of data points across multiple dimensions data will never exist. Unlike other applications Anaplan will retain this in memory even if no data is added.
This is what we mean when we refer to sparsity. There is too many spare data reference points. To minimise sparsity we create a unique reference for each actual combination and create a list only for those combinations of dimensions that will ever contain any data. We achieve this by concatenating all the relevant codes to create a unique master code for that single reference. We compliment this by creating list properties which are themselves format as the required lists and populated this if the individual list items for those dimensions.
Hierarchies are not created in the data hub as we do not run any modelling in the hub so the functionality affording to us from their use is redundant. Hierarchies consume more memory than flat lists due to the need to aggregate up them and as they are not required they should not be created.
I hope this adds to @JaredDolich explanation and provides you with more context.
3 -
Hello @tflenker and @abhaymalhotra
This post directly answers the question we raised about concatenating the key for transactional data in the DH.
Also, read @rob_marshall article about the datahub (particularly the sections under:2