What is a successful Data Warehouse (Part 1)?

In Data Warehousing, the perception of success is different between users and engineers. In part 1, let’s talk about the success factors from the point of view of our users.

Early Start (quick feedback)

The real requirements appears when real users start using the real data warehouse solution. There is no project methodology (or weeks of meetings) that changes this fact. It is human nature. Having that said, we should do everything we can to slice the work into small increments and start delivering the first few features as soon as possible. We want this valuable feedback sooner than later. This feedback cycle will also make the users feel more involved and more willing to help when they see their feedback turn into deliverables in a matter of days instead of months.

Complete (raw and transformed data)

If you leave the raw data behind in the source, you have already applied human decisions to the data while transforming it on its way to the data warehouse. If you discover that your assumptions were wrong you have to redo a very expensive process of re-extracting directly from all your sources. Also, keeping the raw data in the data warehouse is very useful when you are reporting on data quality. The raw data will mostly stay hidden from the business user in a master repository but it will be available very quickly if needed. If you don’t provide a quick and easy way to look at raw data, the end-users will build their own raw data pipelines.

Accurate

Without accuracy, you guarantee the death of your data warehouse in a very short timeframe. You need to at least wrap the most complex and important transformations/calculations into good automated tests that you run every time you change or add a feature. You need to create the most important specifications using real life examples that the business can relate to. You will also use those examples to create your automated test library.

Clear Context

Ambiguous data elements will force your end-users to do a lot of mental gymnastic and it may also produce inaccurate results on reports. You need to establish the context of your data warehouse and work with your end-users to define attribute names and table names that relate to them. You also need to define the appropriate grain of their information so that querying their data becomes a natural process. Also, work into making attributes more precise for reporting by using clear labels instead of using “Yes” or “No”. You can also look at enriching the context by applying textual disambiguation techniques to your semi-structured data.

Frequently Updated (data and features)

We want to embrace the feedback we get from the end-users so we need a development structure that handle this feedback and can update our data warehouse features at the rhythm of the business. We also want a development process and a data architecture that support agility so we can frequently add and deliver new attributes. If our end-users feel that their update requests are falling into a black hole, they will start building their own data pipelines. Adding metadata-driven automation to your data warehouse will also help deliver new features quickly and accurately.

Fast Querying

Usability is a must for a data warehouse solution. Our users will understand that some queries needs to run longer but it is very important that they feel that 90% of their queries return results as fast as possible. You have to stay alert on this. You have to monitor your visualization tools and adjust the architecture, the design or the data model to keep the querying time in the zone that makes sense for your business.

Fast (enough) load

Depending on the business you are in and the goals you have, a wide range of data refresh requirements may occurs from monthly batch to real-time updates. In many instances, doing batch processing is all that is required. You can also do mini-batch processing at very frequent intervals and still use a fairly simple process. The decision to go “real-time” is a very important one and it adds a lot of complexity to your architecture. It requires its own layer (the Speed Layer) and it needs the serving layer to accommodate both the Speed and Batch layer. Overall, it is just important to find what the “good enough” speed is for your situation.

Robust (restartable, recomputable)

The end-users expects that the data warehouse will always be up and running when they need it. They also expect that it will get back up quickly if a crash happens. As architects, we need to design our process in a way that the solution to a crash is a simple restart. We can’t afford anymore, in today’s architecture, to require some complex table manipulations to reset the data warehouse to a previous state for re-processing. We need to embrace new designs like the Lambda Architecture that makes our solutions more robust by simplifying the process flow.

Positive Impact (adding value)

It is very important that the data delivered add value to the business. The best way to make sure of this is to manage a backlog of data warehouse stories and then prioritize it by value and feasibility. To embrace changes in business priorities and eliminate waste you help the business prioritize just a few stories, just enough to keep the development team busy until the next prioritization discussion. By focusing on feasible/valuable features you have a better chance of delivering something that users will find useful. Hopefully, useful data will make a positive impact on our end-users and ultimately on the business itself.

Frequently Use

Succeeding on all the previous points should generate enough buzz to bring end-users into the data warehouse … but it is still possible that old ways of doing things slow down the acceptance of the solution. At this point, you may have to work into dismissing older data feeds and of course train people on the new data. If your team develops the most valuable stories and then nobody use the data warehouse you have a major problem on your hand. It could be a gap between priorities and the people doing the work. It could be that they need just a little more data (the next few stories) to have what they need to get fully started. This is why you don’t just let the business prioritize stories while you watch with your arms crossed. Get involved, learn as much as you can about the business, push the boundaries a little, study the industry to generate new ideas and sit with the people doing the work. Watch, listen and learn.

Easy to Change (well-crafted)

Well-crafted data architectures will support changes and maintenance gracefully. The days of an all integrated data architecture based on one all-powerful RDBMS is something of the past. Today, systems tend to decouple:

  • the collection of data
  • the storage of data
  • the presentation of data.

This decoupling gives you the freedom to build each layer in the most efficient manner using the most appropriate technologies. Decoupling also support agility as it let you start with a simple solution that will evolve gracefully by letting you replace components over the next few years without disturbing the whole pipeline (without major re-architecting phases).

Compliant and Auditable

For Data Warehouses supporting financial reporting you must have a clear set of patterns for tracking history and for proving the good health of your data transformations. You may have to prove this sitting in a meeting with auditors. It is not the right time to look unsure about the data architecture. You need to find a way for older reports to always provide the same results now and forever. The Lambda Architecture helps a lot in this area because processes always run on a full set of master (raw) data and will generate the same output every time. Combine with good code management practices you can always retrieve an old process and run it against the master data and get the same output.

Living Documentation

You end-users will ask for a data dictionary or a source-to-target mapping document, this is a fact. Those documents have a very bad habit of getting obsolete very quickly and it is labor intensive to update them when they get behind the code. As a universal Data Warehousing rule, most of the time, the code becomes “the” document about transformations.

A way around that is to describe the data transformations as automated tests readable by your end-users … before you code. By writing readable tests that are directly executable against your code you are in fact building a living documentation. Your team and the end-users discuss the stories and described them as automated tests. The team code to make the tests pass. The tests are added to a growing library of regression tests that get executed each time we build new features (or change existing features). That library of tests assures that the new features and the older ones works.

Data Warehousing still have a lot to learn from application development on this. The existing testing frameworks are not as mature when working with databases. Also, to make the tests readable, you have to use some type of language like gherkin or some tabular format. After trying many frameworks I am currently writing my own testing functions as part of our development. This is still an area open for some major improvements … and we have no choice really, our users need some documentation at some point.

Written on October 20, 2016