Join the Community

24,302

Expert opinions

40,830

Total members

357

New members (last 30 days)

219

New opinions (last 30 days)

29,354

Total comments

Join Sign in

Aspects of data quality, an Artificial Intelligence unit has to focus on for better outcomes

5 04 July 2021 Be the first to comment

Tejasvi Addagada

Enterprise Data Head

Fortune 500 financial service provider

Clean Data is a crucial need to get an outcome from Machine Learning models. Often much time to the tune of 40% is spent on understanding and cleaning data before modeling and providing an outcome.

Scale and diversity in data is also another important aspect while planning for a model. How accurate is the data to give a usable outcome – is a major question? -> Accuracy measures the degree of equuivalence to the real world information

What can be planned easy – are the machine-learning models, but data is still the prime constituent of AI that often lacks planning. The basic predictive efficiency of AI models is defined by diversity, scale and quality of input data. -> Coverage & Availability ensure that the required coverage of data across customers, products, other dimensions is not missed out but at the same time is made available for use.

Most of the data with Information aggregators or large institutions is not consistent across systems and processes, while it is also not consistently formatted across the organization. -> Structural Consistency & Semantic Consistency helps bring consistency in meaning. A product code in one dataset might not mean the same in another.

Data or Information in a common financial service landscape, is usually available across disparate systems. These systems create, acquire, store, maintain and archive data in varied ways. Here is the challenge in terms of integrating and aggregating data perhaps in a common data lake as an input to AI based services. -> Integration is in-fact an enabler of data management that makes data available in the required feature set or a data-set, on demand.

AI is driving the need to build real-time data flows across institutions to access essential data. Real-Time data flows are still a far cry for most organizations. Here is the next challenge of architecting data flows that can assist in making available streaming data with less Information lag to AI based services. -> Lineage & Currency is imperative to ensuring that the most recent data is available.

We are not just referring to internal data that can be partly trusted but also to external and public data that is also required for scaling the data to AI use. The organizations will have to fix irregularities in missing data and invalid data. -> Completeness ensures that there are no missing information or attributes or features. Often substituting missing information doesnt provide desired outcomes from a model.

Data Governance Gives the Direction to an Organization to Embrace AI

Organizations would also want to monetize their data as it is proprietary data while AI would necessitate that this data must be shared with competitors to reach minimum requirements of efficiency. Monetizing and data sharing need to be addressed with great efficiency in direction and Guidance. -> Corporate Guidance & Policy provides direction in having to manage data in certain scenarios to achieve better availability, quality, meaning & simplified context.

Nevertheless, Financial services Industry participants are making large-scale investments in Artificial Intelligence. However, Regulators are eyeing substantial uncertainties that need to be regulated through guidance in the form of policy, in the use of AI in the banking and financial institutions. Collaborative solutions built on shared data-sets will radically increase the accuracy, timeliness and performance of non-competitive functions. But is there Governance, Guidance and Oversight over the collaboration of data? Data Governance

All the data might not be fit for purpose or contextual to an AI use-case. Let’s refer to an insurance firm that uses alternative data like channel usage characteristics, rather than traditional and passive data to price insurance products for cyber security risk. The vast sources of external and alternative internal data (perhaps unstructured) might not be relevant to the context of the outcome that the model would provide. This makes it even more important to simplify and understand the data better before applying it for purpose.Relevancy.

Summarizing the Enablers for Data Management to be Used for AI

Data Quality: Accuracy, Completeness, Validity, Currency, Availability, Coverage, Structural and Semantic Consistency
Data Governance: Corporate Guidance & Policy,
Content Management: Relevancy, Lineage,
Architecture: Data Integration, Aggregation