Streamlining travel data: A comprehensive approach to trip parsing and expense linking

StevenK
edited December 2023 in Blog

Author: Steven Kraplin is a Certified Master Anaplanner and Senior Financial Analyst at The MathWorks.

In a recent project, I was faced with the task of designing a model that could efficiently consolidate travel data from various systems. The goal was to organize the data into trips and link all related expenses incurred to those trips. However, the initial data provided posed several challenges. Not only did it lack a corresponding index, but it was also disorganized, making it difficult to parse and identify individual trips. Additionally, apart from explicit round-trip travel, multiple records needed to be analyzed to consolidate their aggregate into a trip.

Complexity of the data

For example, imagine Person A books a flight from Boston to Seattle but later decides to change their schedule, resulting in multiple flight bookings. Subsequently, they fly from Seattle to Denver and eventually return to Boston. This sequence of flights could represent either a single round-trip (single record) or a combination of individual segments. Travel preferences can also vary across regions. While individuals in the United States may predominantly choose to fly, those in EMEA might prefer taking a train or renting a car. All these variations need to be considered as valid travel data when constructing a comprehensive "trip" profile.

Step 1: Load the data

I started by loading the disjointed data into lists using a combination of nearly all columns to create a unique identifier. Due to the structure of the underlying data and the variability of columns, each type had to get loaded into its own corresponding list with each column mapped to a property.

For transformation and clean-up purposes I then created a module for each type of travel data referencing everything that had been loaded into a list.

In my model, the primary hierarchy consisted of the following: L1 Region, L2 Business Unit, L3 Employee, L4 Employee Department (as an employee can move departments), L5 Trip, and L6 Travel Item.

Step 2: Create the trips

It was important to do some initial data prep before parsing the available information into “trips”. I had to systematically identify an employee’s home major airport was using their home city as well as all possible airports within X miles they might travel to/from.

Due to the individual’s ability to rebook or modify tickets, the “last transaction date” version of each ticket should be the one to be parsed to create a trip, otherwise there would be many duplicates of the same information. By leveraging the record locator element in the flight data, it was reasonably easy to identify the latest record that ultimately resulted in a flight.

The easiest records to parse were the round-trip flights since they represented a single record that needed processing; these were instantly turned into “trips” with no manipulation. The origin/destination didn’t matter as long as they returned back to their origin airport.

From there, I was able to identify any starting one-way trips that corresponded to a starting point of their home airport or any airport within X miles. Running this created many open-ended trips that needed a corresponding return flight. Finding a return flight using a formula turned out to be impossible because you need to use the corresponding date of the origin flight.

The way I designed the code ended up the key to the remainder of the process to create the trips, identify return flights, and link expenses to their corresponding trips. I used the L4 code mentioned above, added a “|” and a corresponding number based on the ranking of each possible trip record coming from airline tickets (and later other trip sources). This resulted in a code that could be systematically incremented to leverage the equivalent of a search when it comes to identifying corresponding expenses. However, due to the nature of the data and the duplication of records for changes to a flight, when the trips were created there may be gaps in the sequence (e.g. XXXX|1, XXXX|5 etc.).

Note: The above was repeated for Train and Rental car travel as well with little to no deviation.

Step 3: Looping through the process

The initial problem was that there was no easy way to formulaically identify the return trip to all trips outstanding (with only a start). Instead, I designed a process that looped from trip 1 to X, where X represents the highest possible trip count in the system.

The process did the following (at a high-level):

  1. Creating the code: Updated the unique trip counter code by employee.
  2. Checking the trip: Checked if a corresponding trip already exists and identified its start/end points.
  3. Checking trip duration: The system determined if any train, air, or rental car records fell within the predetermined trip length and returned the employee to their starting position.
  4. Flagging eligible records: Within the current iteration of the code, flagging if there’s an eligible return record as well as whether a record falls into an existing round-trip.
  5. Updating the relevant information: The trip name is updated if a return record is found. Records are linked to a trip if they fell in within the appropriate time frame.
  6. Incrementing trip counter: The trip counter was increased by one to move on to the next trip.

The above had to be built to touch every available data set, resulting in a consolidated 50-action process. Executing the above process manually would be extremely time-consuming and frustrating. Imagine having to click through 500 iterations, waiting 10 seconds each time for completion, and resetting the trip counter to one whenever data was loaded.

To overcome this challenge I tested both CloudWorks and Anaplan Connect. Due to CloudWorks limitations (and slower response time) I opted to use Anaplan Connect for my solution. Anaplan Connect initiates the process, executing it up to X number of times, where X represents the highest trip counter available for processing. This eliminates the need for manual intervention and repetitive clicking.

With each loop, a live dashboard provides a real-time update on the creation of trips and the processing of corresponding expenses. This enabled an admin to have efficient tracking of the entire process. A corresponding workflow dashboard walks the admin through the end-to-end process, much of which we eventually hope to fully automate.

Implementing the above process yielded impressive results. With a decade worth of data, approximately 95% of all records can now be parsed automatically and attributed to a trip. The remaining 5% may require some form of manual intervention or involve data that should have been marked as "personal" in some capacity. Nevertheless, this solution has significantly reduced the manual effort involved in trip processing and analysis, saving valuable time and resources.

Step 4: The clean up

As I had mentioned earlier, due to the possible gaps in the trip counter, the maximum number of trips an employee might have taken across all employees in relation to the created code is inflated resulting in unnecessary runs of the above looped process.

To reduce this counter and eliminate the gaps, I designed a module using the L4 list combined with days to check if a trip code exists. If it exists, the module finds its next cumulative value (starting from 1) and then a single run process replaces the identified L5 and L6 codes based on this new value in both the source modules and lists (using Name as the unique identifier).

Once this process is run, all gaps between corresponding trips are removed and the looping processes takes significantly less time to complete.

Conclusion

I found building this model incredibly educational as I’ve never attempted to loop through the same set of data hundreds of times in order to parse it with each iteration. It’s important to note that there are always many different ways to build a model, my attempt used Anaplan’s native functionality and its core ability to processes so many actions in seconds while combining it with a looping mechanism.

I hope others can find this as inspiration for their own builds, whether for travel analysis or something even more exciting!