How do you preserve history?

Following the lambda architecture principles, you discover that the recipe to preserve history is relatively simple … now that storage is cheap and that processing speed increases every day.

The first thing to establish is that the “master dataset” (or “Persistent Staging Area”) is immutable. This implies that we only ever add new data to it. We never delete or replace existing data. So, the data you used to create a financial report in 2014 still exists in its original raw form, and will continue to be available.

The second thing to establish for the transformation code, is that you need to use version control (Github, etc.) to keep every version of the transformations you ever ran in production. You should be already doing that for a lot of good reasons, but it’s an essential component to preserve history in the architecture I propose.

With those two things you can reproduce any reports that were created in the past and you can “try” different processing scenarios until you find the one that works for your business. From this point onwards, you can be as creative as you need to.

A possible scenario would be to work with the end-users to define, for each batch process, the ones that should not be replaced by new versions. For example, instead of building a process called “financial_process” and change it over time, I would create “financial_process_2014”, “financial_process_2015”, “financial_process_2016” and keep all of them alive and available in production in case we need to run them again. Each of those processes would run from the immutable master dataset and produce the exact same output every time.

Keeping the raw immutable master data and all the versions of the transformation code will protect you against pretty much anything.

As you can see, I am slowly moving away from the complex temporal data models that can also support these same scenarios. I know that temporal models work, I did it many times in the past, but you have to balance complexity versus rewards. A data architecture based on immutability will let you experiment with different data models, evaluate the benefits and really prove that a particular flavor of temporal data model is the right solution for a specific scenario … or you may also discover that you don’t need all that complexity.


Written on September 30, 2016