Blog article
See all stories »

Data Quality in Machine Learning.

 

 

 

We regularly see and hear phrases like “data is the life blood of an organisation” or “the world’s most valuable resource is no longer oil, but data”. There is no denying that data is an incredibly valuable resource.  But a theme that is overlooked in many articles or only mentioned in passing is the importance of data quality.

Technology by itself is not a panacea. You can have any technology you like, and you can have much data as you like but if you don’t have high quality data you are taking an immense risk.

This short paper starts by looking at different types of data: quantitative, qualitative, and then looks the challenges of using this data in Machine Learning applications.

 

Quantitative vs Qualitative Data

Quantitative data and the results stemming from it are applauded by many as being “scientific” and more “valuable” than non-quantitative data. However quantitative data is not without faults and limitations. Firstly, quantitative data often results in binary result for example a “yes” or “no” answer. This then maybe used to make decisions without understanding the true meaning of that answer. This approach can result in decisions that do not lead to the most optimal result and even opportunities being missed.  Secondly, there have been many papers written expounding the benefits of quantitative data and it is reasonable to assume many more similar papers will be written in the future. Sometimes we fall into the trap of believing if something is said enough times it must be true or at least have an element of truth.  Thirdly, it is assumed a strong correlation is synonymous with absolute certainty. We sometime say we have found a correlation with 95% certainty and we focus on the 95% certainty. We forget this also means there is a 5% chance the correlation does not exist.

Qualitative data also suffers from faults such as the bias of the researcher, it can be difficult or impossible to replicate the results of qualitative data and the cost and time to generate qualitative data can be considerable.

Whether you are using quantitative or qualitative data the quality of data is key. No matter what technology you use to cut and slice the data, but rubbish data generates rubbish results.

There are many articles extolling the benefits of Machine Learning or AI ability to improve decision making using qualitative or un-structured data quality of quantitative data.  

It would make life of so many people easier if data came in nice easy to use structured packages. An article by Forbes estimated less than 20% of data is structured. Given that so much data is unstructured makes life a little more challenging. Machine Learning applications such as BERT from Google provide an excellent application for making meaning of many un-structured data sets.

 

Data in Machine Learning

Machine learning is dependent on data: quantitative, qualitative, structured and unstructured. More importantly, Machine Learning is dependent on good quality data. The importance of data is illustrated when looking at high level Machine Learning process.

 

Step 1. Data collection

Step 2 Data Annotation

Step 3 Ingest Data into model

Step 4 Train the model

Step 5 Evaluate results

Step 6 Additional classification of data /fine tuning

Step 7 Seek additional data to enhance model

 

Step 2 in this process provides an example of the importance of data. Annotation of data is very expensive and time consuming but also critical to the success of the machine leaning application. One challenge that is overlooked at this stage is the variation in understanding of text by those carrying out the classification. For example, if one person’s background allows them to use elaborated code and another uses restricted code it is likely to result in different interpretation of the same text. You can try and overcome some of these challenges by having guidelines, data reviewed several times and then reaching a consensus. But this adds to the cost and time to build a production Machine Learning application.

 Another key challenge is selection bias throughout this process or even reverse engineering data to generate desired results.  The issue of bias is not new, and many approaches have been implemented to reduce selection bias including taking care in selecting the learning model, taking care in selecting the training data etc. The success or these attempts to reduce or eliminate bias is questionable in many instances.

 In conclusion Machine Learning offers huge potential but before even considering which Machine Learning technology to use attention must be paid to the quality of data and how you can source this data and possibly look at the return on Investment (ROI) as good quality data is not always cheap.

 

6318

Comments: (2)

Ketharaman Swaminathan
Ketharaman Swaminathan - GTM360 Marketing Solutions - Pune 08 September, 2020, 12:241 like 1 like

Re. "Sometimes we fall into the trap of believing if something is said enough times it must be true or at least have an element of truth.", sadly it's not a trap but a law:( 

Called "Clear's Law of Recurrence", it states "The number of people who believe an idea is directly proportional to the number of times it has been repeated - even if the idea is false."

Going by this law, the more the number of people took this post at face value, the less the chances that it would be caught out. 

Tejasvi Addagada
Tejasvi Addagada - Fortune 500 financial service provider - Mumbai 19 September, 2020, 16:52Be the first to give this comment the thumbs up 0 likes

The first Data Quality challenge is most often the acquisition of right data for Machine Learning Enterprise Use cases.

Even though the business objective is clear, data scientists may not be able to find the right data to use as inputs to the ML service/algorithm to achieve the desired outcomes.

As any data scientist will tell you, developing the model is less complex than understanding and approaching the problem/use-case the right way. Identifying appropriate data can be a significant challenge. You must have the “right data.”

More broadly speaking, Coverage, can be categorized under the Completeness Dimension of Data Quality and called the Record Population concept within the Conformed Dimensions standard. This should be one of the first checks to be performed before proceeding to other Data Quality checks.

 

Now hiring