Blog article
See all stories »

Big Data Pitfalls: The Amateur Data Scientist

We know of the potential for big data in financial services to detect fraud, to hunt down rogue traders, upsell services – the list is endless. Depending on who you talk to, big data is anything from the key to unlocking unlimited prosperity or a solution in search of a problem.  I find such polar debates largely pointless – like all interesting issues, the answer is much more nuanced than that.

Our Miraculous Ability to Detect Patterns

One thing that does concern me is the rise of Big Data “exploration” tools.  It’s not that I don’t like the technology - I think it’s amazing.  The visualisations are elegant and impressive.  The performance is enviable, the ability to “do data science” is robust, and yet the approach is often wrong.

The challenge with visual data exploration tools is this:  We human beings are great at detecting patterns.  We recognize friends by the backs of their heads, but we also see floating spoons on Mars, and transform coat hooks into inebriated and belligerent cephalopods.  In other words, we sometimes see things that simply aren’t there.

On the flip-side of this, complex and multi-dimensional numerical patterns are well beyond our ability to grasp visually. It’s difficult to visually depict more than four dimensions or variables on a data point (X, Y, Z coordinates + colour gradient).  Humans are also limited in the number of data points they can accurately process at once – hence all the controversy around offside decisions in football.  For these reasons, humanity has developed robust mathematical tools to help find patterns using both deterministic and probabilistic methods - but they’re not perfect.

Deterministic tools, which presume that all information is known often fail when confronted with complex phenomena. Probabilistic tools presume a measure of “unknowns” and attach probabilities to them – however such models are difficult to use and the results are hard to interpret.

For the programmer or business analyst venturing into the world of data science, beware!  You’ve just enough skill to be dangerous!

The Skills Lacuna

The data integration specialist isn’t a data scientist, and the person calling themselves a data scientist may not be a qualified statistician.

Imagine you’re helping a financial institution’s trading arm build out a scalable and highly available platform to support high velocity, high volume, and wide varieties of data (3-Vs of Big Data).  They wanna do Big Data!

You’re a clever programmer and naturally think statisticians are either lazy or crippled by a lack of programming skill.  You design a program to search every set of data for correlations with every other dataset in the organisation, as well as external datasets, public information, and so on.  Of course you’re hoping to find that piece of predictive magic, a secret sauce for stock trades that would generate a wellspring of cash  A budding statistician, you decide on a 95% confidence interval (2-sigma, or 5% margin of error as it is sometimes called)


You’ve discovered so many new predictive variables; perhaps you’ve found that the 23rd-lagged quarterly Namibian Consumer Price Index is a near perfect indicator of the current price of US steel.  You believe that your new formula for predicting US steel prices is 95% accurate.

Big Mistake.

Without getting too technical, this 95% figure means that 19 out of every 20 times you find a correlation, it will be “real”, and 1 out of 20 (5%), it will be a false-positive.  Now when you’re doing hundreds of tests for correlation, you’ll get potentially dozens of false positives.  This is why particle physicists at CERN’s Large Hadron Collider searching for a Higgs-like boson using a technique that requires an extraordinary number of tests chose to use the 7-sigma level  of certainty. At 7 sigma, there’s only a 0.0000000001% chance that a given test was a “fluke”. 

Where does this leave us?

First, and foremost, I’m not denouncing exploratory research.  Such work is important and the basis of most of the key breakthroughs in human history. 

I’m simply saying that without knowing your tools and methods, you’re bound to make simple mistakes, and many of today’s data exploration tools let you do a lot of heavy lifting without knowing much about what you’re doing.

If your organisation is so “bleeding edge” that new ideas are immediately implemented, STOP.  You’re bound to fall into these or the many other traps in the Big Data Jungle. 

You can avoid all these problems by including Big Data Integration experts, statisticians, industry experts, and operations staff in the conversation. 

I’d also encourage everyone thinking about their Big Data future to do some academic exploration in statistics, courses like these may be a great start -


a member-uploaded image

Comments: (1)

Ketharaman Swaminathan
Ketharaman Swaminathan - GTM360 Marketing Solutions - Pune 03 November, 2015, 10:43Be the first to give this comment the thumbs up 0 likes

The guy in your comic strip seems to be regretting that he attended the statistics class. By not being able to claim that Correlation Equals Causation, he has forfeited many opportunities to earn 15 minutes of fame. No harm picking up a technique or two from How To Lie With Big Data to make a sensational claim that gets past a couple of publications and gets you your 15 minutes of fame. As long as the reader is no more knowledgeable than the writer, you're good.