If you’re familiar with Anaplan, you’ve probably heard the buzz about having a data hub and wondered why it’s considered a “best practice” within the Anaplan community. Wonder no more. Below, I will share four reasons why you should spend the time to build a data hub before Anaplan takes your company by storm.
1. Maintain consistent hierarchies
Hierarchies are a common list structure built by Anaplan and come in a variety of options depending on use case, e.g., product hierarchy, cost center hierarchy, and management hierarchy, just to name a few. These hierarchies should be consistent across the business whether you’re doing demand planning or financial planning. With a data hub, your organization has a higher likelihood of keeping hierarchies consistent over time since everyone is pulling the same structure from one source of truth: the data hub.
2. Data Optimization
As you expand the use of Anaplan across multiple departments, you may find that you only need a portion of a list, rather than the entire list. For instance, you may want the full list of employees for workforce planning purposes, but only a portion of the employees for incentive compensation calculations. With a data hub, you can distribute only the pertinent information. You can filter the list of employees to build the employee hierarchy in the incentive compensation model while having the full list of employees in the workforce planning model. Keep them both in sync using the data hub as your source of truth.
3. Separate duties by roles and responsibilities
An increasing number of customers have asked about roles and responsibilities with Anaplan as they expand internally. In Anaplan, we recommend each model have a separate owner. For example, an IT owner for the data hub, an operations owner for the demand planning model, and a finance owner for the financial planning model. The three owners combined would be your Center of Excellence, but each has their separate roles and responsibilities for development and maintenance in the individual models.
4. Accelerate future builds
One of the main reasons many companies choose Anaplan is for the platform’s flexibility. Its use can easily and quickly expand across an entire organization. Development rarely stops after the first implementation. Model builders are enabled and excited to continue to bring Anaplan into other areas of the business. If you start by building the data hub as your source of truth for data and metadata, you can accelerate the development of future models since you already have defined the foundation of the model, the lists, and dimensions.
As you begin to implement, build, and roll out Anaplan, starting with a data hub is a key consideration. In addition to this, there are many other fundamental Anaplan best practices to consider when rolling out a new technology and driving internal adoption.
Imports are blocking operations: To maintain a consistent view of the data, the model is locked during the import, and concurrent imports run by end-users will need to run one after the other and will block the model for everyone else. Exports are blocking for data entry while the export data is retrieved, and then the model is released. During the blocking phase, users can still navigate within the model.
Rule #1 Carefully Decide If You Let End-Users Import (And Export) During Business Hours
Imports executed by end-users should be carefully considered, and if possible, executed once or twice a day. Customers more easily accept model updates at scheduled hours for a predefined time—even if it takes 10+ minutes—and are frustrated when these imports are run randomly during business hours.
Your first optimization is to adjust the process and run these imports by an administrator at a scheduled time, and then let the user based know about the schedule.
Rule #2 Use a Saved View
The first part of any import (or export) is retrieving the data. The time it takes to open the view directly affects the time of the import or export.
Always import from a saved view—NEVER from the default view. And use the naming convention for easy maintenance.
Ensure the view is using optimized filters with a single Boolean value per axis.
Hide the line items that are not needed for import; do not bring extra columns that are not needed.
If you have done all of the above, and the view is still taking time to complete, consider using the Tabular Multi Column export and filter "in the way out." This has been proven to improve some sluggish exports.
Rule #3 Mapping Objective = Zero Errors or Warning
Make sure your import executes with no errors or warnings as every error takes processing time. The time to import into a medium-to-large list (>50k) is significantly reduced if no errors are to be processed.
In the import definition, always map all displayed line items (source→target) or use the "ignore" setting. Don't leave any line item unmapped.
Rule #4 Watch the Formulas Recalculated During the Import
If your end-users encounter poor performance when clicking a button that triggers an import or a process, it is likely due to the recalculations that are triggered by the import, especially if the action creates or moves items within a hierarchy.
You will likely need the help of the Anaplan Model Optimization team to identify what formulas are triggered after the import is done and to get a performance check on these formulas to identify which one takes most of the time. Usually, those fetching many cells such as SUM, LOOUKP, ANY, or FINDITEM are likely to be responsible for the performance impact. Speak to your Business Partner for more details on the Model Optimization services available to you.
To solve such situations, you will need to challenge the need for recalculating the formula identified each time a user calls the action.
Often, for actions such as creations, moves, assignments done in WFP or Territory Planning, many calculations used for reporting are triggered in real-time after the hierarchy is modified by the import, and are not necessarily needed by users until later in the process.
The recommendation is to challenge your customer and see if these formulas can be calculated only once a day, instead of each time a user runs the action. If this is acceptable, you'll need to rearchitect your modules and/or formulas so that these heavy formulas get to run through a different process run daily by an administrator and not by each end-users. If not, you will need to look at the formulas more closely to see what improvements can be made. Remember, breaking formulas up often helps performance.
Rule #5 Don't Import List Properties
Importing list properties takes more time than importing these as a module line item. Review your model list impacted by imports, and look to replace list properties with module line items when possible. Use a system module to hold these for the key hierarchies, as per D.I.S.C.O.
Rule #6 Get Your Data Hub
Hub and spoke: Setup a data hub model, which will feed the other production models used by stakeholders.
It will prevent production models to be blocked by a large import from an external data source. But since data hub to production model imports will still be blocking operations, carefully filter what you import, and use the best practices rules listed above.
All import, mapping/transformation modules required to prepare the data to be loaded into planning modules can now be located in a dedicated data hub model and not in the planning model. This model will then be smaller and will work more efficiently.
Try and keep the transaction data history in the data hub with a specific analysis dashboard made available for end users; often, the detail is not needed for planning purposes, and holding this data in the planning model has a negative impact on size, model opening times, and performance.
As a reminder of the other benefits of a data hub not linked to performance:
Better structure, easier maintenance: data hub helps keep all the data organized in a central location.
Better governance: Whenever possible put this Data Hub on a different workspace. That will ease the separation of duties between production models and meta-data management, at least on actual data and production lists. IT departments will love the idea to own the data hub and have no one else be an administrator in the workspace.
Lower implementation costs: A data hub is a way to reduce the implementation time of new projects. Assuming IT can load the data needed by the new project in the data hub, then business users do not have to integrate with complex source systems but with the Anaplan data hub instead.
Rule #7 Incremental Import/Export
This can be the magic bullet in some cases. If you export on a frequent basis (daily or more) from an Anaplan model into a reporting system, or write back to the source system, or simply transfer data from one Anaplan model to another, you have ways to only import/export the data that have changed since the last export.
Use a Boolean line item to identify records that have changed and only import those.
You can interact with the data in your models using Anaplan's RESTful API. This enables you to securely import and export data, as well as run actions through any programmatic way you desire. The API can be leveraged in any custom integration, allowing for a wide range of integration solutions to be implemented. Completing an integration using the Anaplan API is a technical process that will require significant action by an individual with programming experience.
Visit the links below to learn more:
Anaplan API Guide
You can also view demonstration videos to understand how to implement APIs in your custom Integration client. The below videos show step-by-step guides of sequencing API calls and exporting data from Anaplan, importing data into Anaplan, and running delete actions and Anaplan processes.
API sequence for uploading a file to Anaplan and running an import action is as follows:
API sequence for running an export action and downloading a file from Anaplan is as follows:
API sequence for running an Anaplan process and a delete action is as follows:
You may have heard about a model called a data hub, but perhaps you aren’t confident that you understand the fundamentals, primary functions, or considerations when architecting one. There are three main advantages to incorporating a data hub:
Single source of truth: Stores all transactional data from the source system.
Data validations: Ensures all data is correct and valid before the data gets to the spoke model(s).
Performance: It is always faster to load data from a model rather than a file.
Additionally, the administrator can ensure the correct granularity of data in the spoke model(s) when using a data hub. For example, the source system may only contain transactional data at the daily level, but the planners may need the data aggregated to the month. The data hub can summarize the data and export only the data needed.
The following information is designed to further define a data hub and support you in your journey of building your own.
Table of Contents
Definition of the Data Hub
First, we need to define what a data hub is. This can be split into four sections:
Use cases: The data hub should be the first model built, whether you have a single use or multiple use cases. The data should be automatically refreshed on a schedule, whether it is nightly, weekly, monthly, etc., from the source system—often an Enterprise Data Warehouse (EDW). All modules and views that create hierarchies or lists should be stored in the data hub, which enables your models in having one version of truth, as well as reducing the duplication of data.
Model connectivity: Anaplan Connect , one of our 3 rd party vendors (Informatica Cloud, Dell Boomi, Mulesoft, or SnapLogic), or our REST API can be used to automate the loading of data to the data hub from the source system, as well as transferring data from the data hub to the spoke model(s). Additionally, transitional data should not be loaded directly into the spoke module, especially if there is a large volume of data.
Functions: Often, simple ETL (Extract, Transform, and Load) functions can be utilized within your data hub to transform the data for the spoke model(s). This is helpful when consolidating data from multiple sources where you have different “codes” and need a mapping module to ensure the correct data gets mapped correctly.
Team: The management of the data hub should have a designated team of experts who understand what data is stored in the data hub (to ensure duplication doesn’t happen), as well as the how and when the data gets loaded.
Anaplan Architecture with a Data Hub
There are several ways your Anaplan architecture could look, depending on the number of workspaces you currently have and the type of security your company requires. The following are illustrations of common architectures.
Master Hub Model: Across Workspaces
The most common, and recommended, architecture is when the data hub is in its own workspace. Not only does this have the advantage of not interfering with the other models, but it also adds an additional security layer, with a segregation of duties. In this view, the Anaplan Workspace Admin(s) can limit the access to the data hub workspace to only the people who require it.
Master Hub Model: Within a Workspace
The simplest depiction is where your data hub is within the same workspace as your spoke models. While this can be accomplished, it is not best practice as there is no segregation of duties and there is a possibility, upon heavy loads from the source system, of performance issues. Additionally, when adding users, the Anaplan Workspace Administrator (Admin) would need to ensure users don’t have access to the data hub, as well as any users of the data hub not having access to the spoke models
Multiple Data Hubs
Finally, the data hub doesn’t necessarily have to be the only model in the workspace. You can have additional data hubs, if needed.
Factors to Consider When Implementing a Data Hub
There are six main elements to think about when architecting a data hub:
Exporting data to spoke model(s).
One of the cornerstones of The Anaplan Way is data (process, model, and deployment being the others), which is critical to all implementations. You will need to know what data is needed for a certain use case. Consider the following, common, data questions that need to be answered in order to be successful:
What granularity of the data is needed?
How much history is needed? How much history do you have?
Does the source system only have transactional data, but the use case needs the data at the month level? Can the source system do the aggregation for you?
After all data questions have been answered, shift your focus to the source system and consider the following:
Consider the source system. Where is the data coming from? What is the source system, and is it a trusted environment? Is it Excel? Typically, you should stay away from Excel as the “source” because Excel cannot be audited.
Define the data source owners. Who has access to this data? Who is preparing it? Are they part of the project? These are often-overlooked questions that are critical to success. Ideally, the data source owners need to be part of the project from the start to understand the file specifications and prepare the initial load of the data, as well as towards the end of the project to do a final load of the data.
Define file specifications. How many files will be needed? Typically, you will need master data, as well as transactional data. Instead of having one file with all of this data, determine if the data can be split between different files (one for transactional, one for the unique members of the master data). It will be better for Anaplan (for performance reasons) to split these to reduce warnings during the data load process.
Analyze the data. Understand what makes each record unique (date/period and transactional amounts should not be part of this), and make sure the data owners don’t give you everything (Select * From Employee) when you only need five columns. Remember, it is better to ask for additional columns midway through the project than getting all columns in the beginning and only using a select amount.
Consider custom codes in the source system. Find more on this in the transactional lists section. This is a great trick for transactional data. After you have analyzed the data to understand what makes each record/row unique, concatenate the “codes” of the metadata into one transactional code, but remember, you will need to be under the 60-character threshold.
Define the schedule. When is the data available? Is the data on a certain schedule? What is the schedule required with this use case?
Determine the ETL medium to be used. Will Anaplan Connect be sufficient, will one of our 3 rd parties be used, or will a more custom application be needed, such as REST API? Does your company already have this experience inhouse, or will training be required? These will need to be factored into all data stories.
Usually, the largest lists are those containing transactional data. There can be millions of transactional ID’s with several list properties defined. First, properties should not be defined on a transactional list (or any list, except for Display Name, as they do get accounted for in the workspace memory). Secondly, instead of loading metadata to list properties (Cost Center and Account as properties), try to figure out a way to incorporate them into the code. If the transactional data is defining a transactional amount at the intersection of Cost Center and Account for a particular month, attempt to use the code of the Cost Center and the code of the Account concatenated together (0100_57000). Not only will this decrease your list size, but it will also create a healthier model.
In the below example, the model builder did not create a custom code, but rather used a combination of properties to make the record unique, which included the date/period, as well as the transactional amount. Notice the original number of records vs. the number of records after a custom code was created.
By incorporating the date/time period, as well as the transactional amount, it inflated the list size exponentially based on the number of years that were loaded. Doing this not only caused the model to be bigger, but also caused poor model opening performance. See the Appendix for a simple worked example to explain further.
Learn more about sparsity in the two-part series The Truth about Sparsity: Part 1 and The Truth About Sparsity: Part 2 .
Similar to transactional lists, flat lists are not part of a hierarchy and are a series of records grouped in a list, like Products, Companies, Cost Centers, or Employees. These are your “legends” or “anchor” for all metadata about this unique record. Again, the only property that should be defined is a Display Name, if needed. It is best practice, from a model builders’ perspective, to suffix the name with “Flat” or “- Flat”. This helps identify whether the list is part of a hierarchy or flat list (Employee – Flat, Cost Center – Flat, Product – Flat). These lists can be used for data validation, which will be described later in this article.
Ideally, you should have three types of modules in the data hub:
Transactional: A Transactional module will store the transactional data by the time series, whether that be by day, week, month, quarter, or year. The only data, or line items, should be transactional data. No other line items should be defined. Additionally, to keep the size down, make sure the summaries on the line items are turned off, or None, as there is no reason to sum the data within the module.
System: System (SYS) modules, or the “S” in DISCO , do not have time associated with them and should only be dimensionalized by the same list (Employee Flat, Cost Center Flat, Product Flat). These modules store the metadata or attributes about the list item that doesn’t change over time, for example the employee’s start date. Another example of a SYS module would be any kind of mapping that is required, whether it be SYS Time Filter module or a mapping from one source system to another.
Export modules: If the data from the source system is being loaded at a lower granularity than needed in the spoke model(s), export modules can aggregate the data to the specified need (month, quarter, or year level), which will lead to more efficient data load performance to the spoke model(s). Additionally, it is better to only load the granularity of data needed instead of loading all data to the spoke model, but only using a portion of it.
Loading Data vs. Using Formula’s in SYS Modules
If you can devise a custom code where all of the attributes of the data are accounted for, you can greatly increase the performance of your data load, especially on very large data volumes. It is actually faster to use formulas to derive the data from the custom code than it is to load the data. Why? A couple of reasons. First, when data is loaded, the load is triggering the change log, and every change is being recorded in the model history . Second, loading data to another module is an additional action. If you didn’t need that action, you would save processing time.
In the example below, the exact same data was loaded four different ways:
Import Properties to a List: A list was created with all attributes, including the transactional data, and was loaded to list properties (not best practice and against DISCO).
Import to List, Attribute, and Trans: A list was created, the transactional data was loaded to a transactional module, and all of the attributes were loaded to a SYS Attribute module.
Import to List, Trans, Calculate Attribute: A list was created, the transactional data was loaded to a transactional module, but the SYS Attribute model was calculated using two different methods:
One Line Item: Using FINDITEM() with several functions parsing out the information from within the FINDITEM(). For example, FINDITEM(Cost Center, RIGHT(LEFT(Trans Details.Code, '2nd Group’), 3)).
Multiple Line Items: Parsing of the member spread across multiple line items and using FINDITEM() with only the list and code as the parameter. First, you do the parsing to get the correct piece of the code (one line item), and then the FINDITEM() of that code (2 nd line item).
Notice, the best performing data load was the last one, Import to List, Trans, Calculate Attribute (multiple line items), where the parsing out of the data was spread over multiple line items. This is due to the fact that the data load was able to take advantage of Anaplan’s multithreading capabilities. The worst performing data load occurred when data was loaded to the Attribute module because, due to the sheer size of the data, a save had to be performed.
Exporting to Spoke Models
One of the most important concepts to remember when exporting data is to use a view from a module. Lists should not be exported because you lose control over what you export. It is either all or nothing. By using views, you can employ a filter (should always be a Boolean) to render exactly which data needs to be exported. If you need more than one filter, combine both into one line item and use that line as the filter. You will have much better performance if you are only using one Boolean line item as a filter vs. having multiple filters defined.
Another important concept to remember is to only export detailed information, as there is no reason to export parent information (quarter, year, etc.). Not only will you get warnings when exporting parent information, but the performance of the export will decrease because the system will have to create a debug log. The goal is to make sure a debug log is not created, all green checks, so if there ever is an issue, you will know it truly is an issue that needs attention.
Line items in the data hub formatted as text should not be exported as text, but actually as list formatted line items in the spoke model (text->list formatted line item). The goal is to reduce the number of text formatted line items in the spoke model.
Some say they need to do validation in the spoke model, therefore they need to import the data as text. Actually, this is false, because the validation should have already been done in the data hub, so there should be no need to do the validation again.
Lastly, you should think about what really needs to be exported. Do you really need to export historical data that hasn’t been changed? Instead, just export the newly loaded data, or delta data. This can be accomplished by using one of two methods:
From the source system, request IT to only send the updated information, not the full load every time. Additionally, request IT to create a column in the source file with a hardcoded value of “TRUE.” This will tell Anaplan which row is new or has been updated and can be used as a filter for an export. Just know, before the import of the source data gets loaded, make sure the first action within the process clears out the previous true records (set this up via a view using a filter where the view only shows members with a value of true).
Utilize the current period function to only export the current period data. In the SYS Time Filter module, create a line item named Current Period with the formula CURRENTPERIODSTART(). In the export views, filter the data on this line item.
Tips and Tricks
A few of tips and tricks to be aware of include the following:
Hierarchies should not be in the data hub.
Analytical modules should not be in the data hub.
Do not delete and reload lists.
Data Validations Model
Why should hierarchies not be in the data hub? To answer that question, you need to understand why hierarchies are used in the first place. Essentially, hierarchies are only needed to aggregate data for analytical purposes, and since users will not normally login to the data hub, the lists essentially take up space. With that said, it is perfectly okay to create the hierarchies for testing purposes to ensure your actions from the meta modules are building the hierarchies correctly, but as soon as the actions are working correctly and have been verified, you can remove the list structures from the data hub. A case can be made that certain implementations may need the hierarchies created in the data hub for validation purposes of several sources. If this is the case in your implementation, just be sure to only use the hierarchies for validation purposes.
In addition to the above, there are two more reasons to not have hierarchies built in the data hub—cluttered data, and spoke models that pull data from the lists.
Data hubs need to be clean and clutter free to ensure optimal performance, which also makes it easier for the administrators to understand exactly what data is stored in the data hub. Additionally, when you have lists—especially hierarchical lists—spoke model builders will sometimes build their lists from the lists within the data hub instead of from a view. It is best practice to always build lists from views from within a module so the action can benefit from filters (there are no filters when importing from lists).
Analytical modules should not be in the data hub since end users don’t normally access the data hub. There really isn’t a reason to have products by versions by time in the data hub, that belongs to the spoke model. Remember, the data hub should only be used to store data from the source system(s).
Within your nightly data load process, do not delete and reload data, including the list structures. If you have a proper code, you shouldn’t need to do this. Additionally, not only does this impact the overall performance of the process (adding an additional action to delete the list, which then deletes all data associated with that list), but the process is essentially filling up the change log with the exact same data that it had before the delete. When a certain threshold is surpassed, the model will require a save, thus taking up even more time. Ultimately, you are forcing the model to re-aggregate all of the data, instead of just the new data.
Lastly, if you know you will have to do a lot of transformations on your data (consolidating multiple source systems or your data is not clean), think about creating a Data Validations model. This model’s sole purpose would be to clean the data and then feed the data to the data hub, thus keeping the transformations to a minimum in the data hub as well as keeping the data hub clean.
Use Case: Transaction Data is by Store and SKU and Month
The code for the Transaction list is a three-part code Store_SKU_Month
Attributes for Store, SKU and Month are imported as Text and matched against the Store list, SKU list and Time period respectively
An additional line item is needed for the Store and SKU code (for export).
This is the screenshot of the bad way:
Notice the repetition of the attributes. STR07 and SKU031 are repeated each month.
Two data files
Unique combinations of Store and SKU (two-part code)
Store SKU code by month for the quantity.
The transaction details are stored in a module dimensioned by Transactions
The Store and SKU attributes are calculated using the “_” delimiter
The quantity is stored in a module dimensioned Transactions and by month
The additional line item is needed for the Store and SKU code (for export). This is a subsidiary view in the module as it is not dimensionalized by Time.
These are the screenshots of the good way:
Below lists out the breakdown of the model in terms of List size, Line items and the associated member usage of the various structures. The main reasons for the improvement are because lists themselves account for approximately 500b for each member and also there is repetition of the attributes per “month” in the transaction data (as mentioned above).
Hopefully, this article has shed some light on data hubs, how they should be used, and what you can do to ensure they perform at their peak level. Remember, analyze the data to understand what makes the row unique and use that as the code. Every list should have a code—every list!