I am an ardent proponent of architectural principles as a way of guiding the work of solution architects. Based on solid The Open Group Architecture Framework (TOGAF) practice, one of my frequent ‘go-to’s is the principle that data should always be accessible.
However, having recently spent some time looking at the area of big data, I have become increasingly convinced that there is at least one design trend from this area that we should apply to other data problems; “data is immutable .”
Nathan Marz, a leading thinker in real time analytics and founder of Storm (the real-time data processing tool recently adopted by Apache), is creating his Lambda architecture pattern based on this tenet. All well and good you may say if I’m processing billions
of tweets every hour, but that’s so far away from my world as to be completely irrelevant. My data
changes. Well of course it does, but if you look closely at the advantages of a world where nothing changes, you may want to reconsider your view.
For me, the 4 key advantages of immutable data are:
Today we are well versed and practiced in the back-up and recovery of systems (especially around the release of new code). However, we have all faced situations where an undiscovered problem has been left festering for an extended period, corrupting good
data and making the assessment of business impact almost impossible to measure forensically, the current state being the product of many updates over time. In such instances, our solution is to try and recover from the last known ‘good’ state and replay. However,
wouldn’t it be far easier if our original data hadn’t changed and we could atomically review all changes in the data store itself? Even if the logic is correct and we have stray data values, we can at least exclude them individually if appropriate.
Audit trails usually only store manually entered data, and automated intra-day data is often discarded. This means that that if you suddenly find discrepancies in your data it can be extremely difficult to discover and rectify the root cause of the problem.
With immutable data you are guaranteed to have all the data you need available to solve the issue and ensure that it doesn’t occur again. It may not be easy to manage all that data, but at least you have it.
In solution development, as in other areas of life, we always strive to keep things as simple as possible. When dealing with data, we are faced with a choice between making the write complex or making the read complex. Qualitatively writing (having to cope
with locking, hashing, incremental versioning and the like) is inherently more complex than reading (working out what the correct value is either now or in the past), so it makes sense to keep the write as simple as possible.
We are not locked into a subset of functions across the data. Given access to immutable data we can construct results / statistics as they were at any previous point in time or aggregate the data in a new way and generate data for backtesting or analytics.
Given the above rationale, perhaps it’s time that we review our architectural principles. My encounter with big data has taught me that there are a number of benefits to working with immutable data. In my opinion, we should include the architectural principle
that “data is immutable” and make exceptions where we need to apply storage and processing constraints. So, “data is immutable” is going into my boilerplate principles list. Let’s see how this goes!