You may have heard about a model called a data hub, but perhaps you aren’t confident that you understand the fundamentals, primary functions, or considerations when architecting one. There are three main advantages to incorporating a data hub:
Additionally, the administrator can ensure the correct granularity of data in the spoke model(s) when using a data hub. For example, the source system may only contain transactional data at the daily level, but the planners may need the data aggregated to the month. The data hub can summarize the data and export only the data needed.
The following information is designed to further define a data hub and support you in your journey of building your own.
Table of Contents
First, we need to define what a data hub is. This can be split into four sections:
There are several ways your Anaplan architecture could look, depending on the number of workspaces you currently have and the type of security your company requires. The following are illustrations of common architectures.
Master Hub Model: Across Workspaces
The most common, and recommended, architecture is when the data hub is in its own workspace. Not only does this have the advantage of not interfering with the other models, but it also adds an additional security layer, with a segregation of duties. In this view, the Anaplan Workspace Admin(s) can limit the access to the data hub workspace to only the people who require it.
Master Hub Model: Within a Workspace
The simplest depiction is where your data hub is within the same workspace as your spoke models. While this can be accomplished, it is not best practice as there is no segregation of duties and there is a possibility, upon heavy loads from the source system, of performance issues. Additionally, when adding users, the Anaplan Workspace Administrator (Admin) would need to ensure users don’t have access to the data hub, as well as any users of the data hub not having access to the spoke models
Multiple Data Hubs
Finally, the data hub doesn’t necessarily have to be the only model in the workspace. You can have additional data hubs, if needed.
There are six main elements to think about when architecting a data hub:
One of the cornerstones of The Anaplan Way is data (process, model, and deployment being the others), which is critical to all implementations. You will need to know what data is needed for a certain use case. Consider the following, common, data questions that need to be answered in order to be successful:
After all data questions have been answered, shift your focus to the source system and consider the following:
Usually, the largest lists are those containing transactional data. There can be millions of transactional ID’s with several list properties defined. First, properties should not be defined on a transactional list (or any list, except for Display Name, as they do get accounted for in the workspace memory). Secondly, instead of loading metadata to list properties (Cost Center and Account as properties), try to figure out a way to incorporate them into the code. If the transactional data is defining a transactional amount at the intersection of Cost Center and Account for a particular month, attempt to use the code of the Cost Center and the code of the Account concatenated together (0100_57000). Not only will this decrease your list size, but it will also create a healthier model.
In the below example, the model builder did not create a custom code, but rather used a combination of properties to make the record unique, which included the date/period, as well as the transactional amount. Notice the original number of records vs. the number of records after a custom code was created.
By incorporating the date/time period, as well as the transactional amount, it inflated the list size exponentially based on the number of years that were loaded. Doing this not only caused the model to be bigger, but also caused poor model opening performance. See the Appendix for a simple worked example to explain further.
Learn more about sparsity in the two-part series The Truth about Sparsity: Part 1 and The Truth About Sparsity: Part 2.
Similar to transactional lists, flat lists are not part of a hierarchy and are a series of records grouped in a list, like Products, Companies, Cost Centers, or Employees. These are your “legends” or “anchor” for all metadata about this unique record. Again, the only property that should be defined is a Display Name, if needed. It is best practice, from a model builders’ perspective, to suffix the name with “Flat” or “- Flat”. This helps identify whether the list is part of a hierarchy or flat list (Employee – Flat, Cost Center – Flat, Product – Flat). These lists can be used for data validation, which will be described later in this article.
Ideally, you should have three types of modules in the data hub:
If you can devise a custom code where all of the attributes of the data are accounted for, you can greatly increase the performance of your data load, especially on very large data volumes. It is actually faster to use formulas to derive the data from the custom code than it is to load the data. Why? A couple of reasons. First, when data is loaded, the load is triggering the change log, and every change is being recorded in the model history. Second, loading data to another module is an additional action. If you didn’t need that action, you would save processing time.
In the example below, the exact same data was loaded four different ways:
Load Performance
Notice, the best performing data load was the last one, Import to List, Trans, Calculate Attribute (multiple line items), where the parsing out of the data was spread over multiple line items. This is due to the fact that the data load was able to take advantage of Anaplan’s multithreading capabilities. The worst performing data load occurred when data was loaded to the Attribute module because, due to the sheer size of the data, a save had to be performed.
One of the most important concepts to remember when exporting data is to use a view from a module. Lists should not be exported because you lose control over what you export. It is either all or nothing. By using views, you can employ a filter (should always be a Boolean) to render exactly which data needs to be exported. If you need more than one filter, combine both into one line item and use that line as the filter. You will have much better performance if you are only using one Boolean line item as a filter vs. having multiple filters defined.
Another important concept to remember is to only export detailed information, as there is no reason to export parent information (quarter, year, etc.). Not only will you get warnings when exporting parent information, but the performance of the export will decrease because the system will have to create a debug log. The goal is to make sure a debug log is not created, all green checks, so if there ever is an issue, you will know it truly is an issue that needs attention.
Line items in the data hub formatted as text should not be exported as text, but actually as list formatted line items in the spoke model (text->list formatted line item). The goal is to reduce the number of text formatted line items in the spoke model.
Some say they need to do validation in the spoke model, therefore they need to import the data as text. Actually, this is false, because the validation should have already been done in the data hub, so there should be no need to do the validation again.
Lastly, you should think about what really needs to be exported. Do you really need to export historical data that hasn’t been changed? Instead, just export the newly loaded data, or delta data. This can be accomplished by using one of two methods:
A few of tips and tricks to be aware of include the following:
Why should hierarchies not be in the data hub? To answer that question, you need to understand why hierarchies are used in the first place. Essentially, hierarchies are only needed to aggregate data for analytical purposes, and since users will not normally login to the data hub, the lists essentially take up space. With that said, it is perfectly okay to create the hierarchies for testing purposes to ensure your actions from the meta modules are building the hierarchies correctly, but as soon as the actions are working correctly and have been verified, you can remove the list structures from the data hub. A case can be made that certain implementations may need the hierarchies created in the data hub for validation purposes of several sources. If this is the case in your implementation, just be sure to only use the hierarchies for validation purposes.
In addition to the above, there are two more reasons to not have hierarchies built in the data hub—cluttered data, and spoke models that pull data from the lists.
Data hubs need to be clean and clutter free to ensure optimal performance, which also makes it easier for the administrators to understand exactly what data is stored in the data hub. Additionally, when you have lists—especially hierarchical lists—spoke model builders will sometimes build their lists from the lists within the data hub instead of from a view. It is best practice to always build lists from views from within a module so the action can benefit from filters (there are no filters when importing from lists).
Analytical modules should not be in the data hub since end users don’t normally access the data hub. There really isn’t a reason to have products by versions by time in the data hub, that belongs to the spoke model. Remember, the data hub should only be used to store data from the source system(s).
Within your nightly data load process, do not delete and reload data, including the list structures. If you have a proper code, you shouldn’t need to do this. Additionally, not only does this impact the overall performance of the process (adding an additional action to delete the list, which then deletes all data associated with that list), but the process is essentially filling up the change log with the exact same data that it had before the delete. When a certain threshold is surpassed, the model will require a save, thus taking up even more time. Ultimately, you are forcing the model to re-aggregate all of the data, instead of just the new data.
Lastly, if you know you will have to do a lot of transformations on your data (consolidating multiple source systems or your data is not clean), think about creating a Data Validations model. This model’s sole purpose would be to clean the data and then feed the data to the data hub, thus keeping the transformations to a minimum in the data hub as well as keeping the data hub clean.
Use Case: Transaction Data is by Store and SKU and Month
Bad Way
This is the screenshot of the bad way:
Notice the repetition of the attributes. STR07 and SKU031 are repeated each month.
Good Way
These are the screenshots of the good way:
Below lists out the breakdown of the model in terms of List size, Line items and the associated member usage of the various structures. The main reasons for the improvement are because lists themselves account for approximately 500b for each member and also there is repetition of the attributes per “month” in the transaction data (as mentioned above).
Hopefully, this article has shed some light on data hubs, how they should be used, and what you can do to ensure they perform at their peak level. Remember, analyze the data to understand what makes the row unique and use that as the code. Every list should have a code—every list!
Hi @rob_marshall - question on list hierarchies. For a store network have the Store, Region, Division. So would have three lists and then a module dimensioned on Store with Region / Division as line items that would be updated via import from source and then this would flow into the spoke models as a hierarchied list starting at the top and going down?
Hey @andrewtye,
Yes, ideally you would like for Store, Region, and Division to be flat lists in the data hub and the transactional data coming into a module with the "transactional" key/list to be a concatenated with the members to make the row unique.
Then, you would have a "properties" module defining all master data for that unique row.
Lastly, you would have views defined on the properties module to create the hierarchy in the spoke model.
Hope this helps,
Rob
Thanks @rob_marshall !
This is a fantastic article, thanks for sharing.
When you did your analysis regarding "Loading data vs using Formulas", was file type accounted for?
I know zipped files are loaded into Anaplan much more quickly than normal csv/txt if they are large and just wondered if this was included in processing time.
Also, it would be good to have some clarity over "upload" time and "processing" time, I'm not sure if there is any distinction between them in an upload process but I would assume so, and I presume this article focuses on the processing time?
Best,
Callum
@CallumW ,
Thanks for reading it, I am glad you liked it. Regarding the file types, this was not part of the article nor the actual loading of a file into Anaplan, more of loading the data to the line items when the file had already been uploaded vs. using a formula to determine the data (from the code). Does that help and clarify?
Thanks,
Rob
Appreciate the quick reply @rob_marshall .
I guess my question would then be whether there is any processing benefit in using different file types when loading into line items, or is there a process that converts to "Anaplan language" upon upload and therefore before loading into line items, thus no benefit?
i.e. the only benefit of loading different file types would be seen on the upload stage before Anaplan loads into line items?
Apologies for the confusion!
In asking around (Ben Speight), there is no difference in uploading a csv file vs. a text file and there is no difference in the reading of the data from the file and importing them to line items. Now, if you are using Anaplan Connect, I believe the data gets zipped on the way into Anaplan but if you are using a browser, the data does not get compressed.
Hope this helps,
Rob
Appreciate the follow-up, @rob_marshall !