I've been asked several times about the difference between Data Mining and Predictive Analytics. Well, that's not strictly true. I've been asked only once and that too in the following tweet from @IBMAnalytics visible to me among 15.7K of its followers:
Replies ranged from "depends" - will some people begin all their answers this way? - to "Predictive Analytics is a form of Data Mining that does blah blah blah".
Since I wasn't too satisfied with any of these definitions, I came up with one of my own:
Data Mining uncovers relationships between measurable variables whereas Predictive Analytics surmises outcomes from measurable variables.
Although all forms of data analyses are colloquially called "mining of data", there are significant differences between Data Mining and Predictive Analytics, as the above definition suggests.
Both branches of analytics are grounded on a huge amount of mathematical theory dating back to several decades. However, since this isn't a mathematics journal, I'll skip all that and restrict myself to using some real world business scenarios to illustrate
the difference between the two technologies.
Data Mining (DM)
Sales of cigarettes and detergent rose and fell in tandem in Norfolk, Virginia.
- Sales of both products are measurable variables.
- No one specifically asked the DM software to find any relationship between the sales of these two products. In fact, apart from the most clairvoyant amongst us, no one would even suspect that there was any such relationship. I can hear marketers and brand
managers say, "Cigarette and mint sales could be related. Ditto washing machine and detergent. But how can there be any relationship between cigarettes and detergent?"
- The DM software crunches sales data of many products - not just cigarettes and detergents - in Norfolk to uncover this hitherto unknown relationship between just these two product categories.
- Even after they receive the software's insight, brand managers might question it: "Why're sales of cigarette and detergent linked?" This is typically how DM works. It focuses on correlation ("what") and not cause-effect ("why").
- Data Mining generally operates on aggregate data gathered over a period of time.
Predictive Analytics (PA)
John Doe's credit card was used fraudulently at D-MART Mumbai just now. (D-MART is a leading Every Day Low Price retailer in India)
- Fraud is an outcome.
- The PA software starts with geographical location, spend velocity, category spends and other measurable variables related to John Doe, the cardholder in question. It then computes an index, based on which it qualitatively flags off the transaction as genuine
- Predictions are generally the result of cause-and-effect relationships.
- The PA software starts with the following inputs ("cause"): (a) the cardholder's billing address is Norfolk, Virginia, USA (b) the last genuine transaction on this card happened in McCarthy Mall in Norfolk two hours ago (c) It takes at least 14 hours to
travel from Norfolk, USA to Mumbai, India. By joining these dots, the software infers that John Doe couldn't have reached Mumbai from Norfolk so soon and draws the conclusion ("effect") that John Doe's card was used by a fraudster in Mumbai.
- Predictive Analytics generally operates in real time on individual transactions.
Based on the above illustration, if you get the feeling that Predictive Analytics is "more scientific" than Data Mining, you're not alone. Business folks have always been uncomfortable accepting DM conclusions that don't make business sense. Paraphrasing
Hamlet's entreaty to Horatio, Data Mining practitioners have retorted with "there are more things in bits and bytes than are dreamt of in your MBA". Without taking sides, I'm not surprised that Data Mining has not yet reached the business mainstream despite
being around for such a long time and having garnered support from many vocal flagbearers including Tibco CEO, Vivek Ranadivé.
However, that doesn't mean that you should forget about Data Mining and divert your entire analytics budget to Predictive Analytics.
I'll explain why in another blog post in future. Spoiler Alerts: Predictive Analytics and false positives; Data Mining's "two second advantage"; and making analytics strategic. Stay tuned.