The Half-Life of Data and Compounding.
While there is much talk (hype?) about “Big Data” there are a couple of aspects that the advocates (mainly vendors offering technology and services) seem to neglect or at least underplay.. These relate to the quality of the raw data and its eventual usefulness.
So often one hears complaints about the poor quality of data and the need for costly remediation exercises, but in large part because senior people don’t understand or don’t want to acknowledge some truisms until they contribute to a crisis.
My degree was in Physics with a good dose of embedded mathematics and it is from that learning I will draw my illustrations along with considerable experience of building data functions and delivering change reliant upon data. As they say you may take the
boy out of science, but it is harder to take the science out of the boy!
The Half-Life of Data
My first illustration is something I call the “Half-Life of Data”. In science there are elements that decay through radiation. Through this process the atoms change and become something different from what you started with and wanted. The term half-life
is the time it takes for half the material you start with to decay and is a constant for a given element. For example if you started with 100kg of element X that had a half-life of ten days, after ten days you would only have 50kg of element X (ie 50% of 100kg)
and in a further ten days you would only have 25kg of element X (ie 50% of 50kg). As I said the portion that has decayed has not disappeared, but rather turned into something else that you may not want and cannot use in the same way you would use element X.
In nature the half-life of different elements can vary from minute fractions of a second to hundreds of thousands of years. Carbon14 dating uses this aspect to age archaeological samples.
Well many elements of data have something like a half-life too. Start with things like telephone number or residential address. It is not unlikely that someone will change their mobile phone every couple of years, either the change job and get a new phone
or they change their personal provider or indeed they move house (for a landline). So when taking a set of mobile phone numbers if you do nothing to maintain them, one might look to half of them being wrong with something like 3 – 5 years (I am guessing, but
I hope you get the idea). If one looked the captured age (ie not calculated from a birth date) then I would expect half to be wrong within six months. My experience suggests that email addresses are probably somewhere in between those two.
Of course some elements, if captured correctly, will not change eg date of birth. That said one would not want to rely on only the date of birth to know if a person was currently alive, you need to apply additional information for that.
The point I am making is that data decays too unless you put effort into maintaining it and the bigger the dataset the more costly the maintenance. Understanding the decay rate of an individual element and the effort put into maintaining it can help one
understand better the reliance one can put upon that element when used in analysis. Just finding a technology that allows you to access, analyse and report on bigger datasets does not mean that you will have good information coming out. The old maxim of “GIGO”
or garbage in, garbage out is still so true.
Erosion Through Compounding
My second point relates to the mathematical concept of compounding or multiplication. In this the compounding of a number of quality measures produces a lower overall quality. This happens because the value of quality cannot be greater than 1 ie perfection
is 100% or 1 and multiplying number less than one creates an even smaller number.
Put simply if I use two data elements to produce an analysis and we think the quality of one is 90% and the other is 80%, simple maths suggests that only 72% of the resultant analysis will be correct ie 80% * 90%. If a third data element was used with only
50% quality then the result would only have 36% accuracy ie 72%*50%.
Now I know that skilled mathematicians will point out that this assumes no correlation between the quality of the two data elements, but this is sufficient for illustrative purposes ie the more imperfect data elements you use the more unreliable the answers
you will produce.
Don’t just buy the technology
By definition Big Data is likely to use more data elements and larger data sets to produce what are offered as new insights. I am sure that these can be produced, but before embarking upon an expensive endeavour I recommend that you at least consider the
impact of these two considerations on your desired outcome.
To my mind the result of this deliberation may do one of three things;
- You choose to use a smaller dataset that you can have greater assurance about.
- You understand and accept the cost you will have to invest in data maintenance in order to fight the natural rate of decay; or
- You acknowledge the quality of each data element (maybe with some metadata about each elements half-life, etc) and reflect that in your answers.
I would be interested in what any else has been thinking along these lines