Join the Community

23,493

Expert opinions

41,329

Total members

335

New members (last 30 days)

176

New opinions (last 30 days)

29,138

Total comments

Join Sign in

BigData Lake for Financial Services - Need to stress on Platform Governance

26 July 2020 Be the first to comment

Tejasvi Addagada

Enterprise Data Head

Fortune 500 financial service provider

As Banks and Insurance firms have already embraced Data Lakes for their Artificial Intelligence and Machine learning capabilities, it is important to look for continuous Return on Investment on the platform.

If a Data Lake is not well maintained, it can turn into a swamp while finding usable data can confuse the data consumers. Most challenges can be solved by including an active platform governance of the Data Lake.

A data lake as a distributed file system hosts authoritative copies of source data having a variety of data that include assorted formats including structured, semi-structured formats like a JSON, XML and unstructured data like images, audio.

Accumulating technical debt with business use-cases will often lead to increased up-front costs during migration and maintenance costs of existing data.

Lack of data-trust often leads to consumers getting their own copies of data onto the data lake though they might exist already. However, due to lack of self-service discovery capabilities – other consumers might not be able to find the right dataset.

The focus areas of a data lake Technology operating model should be on the below aspects of Data Management –

Data Cataloging – A know-how on where the data is coming from is not available after ingesting and building pipelines. Also, what data exists in the lake and relevant business context of the data being applied there is required.
DataReuse – Before ingesting Data, it is always advisable to see if an existing coverage for data is available through discovery. If a data-asset exists, it should be re-used.
Data redundancy – Maintaining multiple copies of same data for different use-cases can be high on the data management cost including Data Quality and Metadata Management.
Investment can be made in a Business Information Model rather than maintaining redundant data on the cluster
Physical replication of a data asset on multiple Data Nodes, is a best practice configured for reliability & Fault tolerance. This aspect is different from maintaining different copies of the same asset by data providers or consumers.
Authoritative copy certification – When the data lake has been active for some time, and there are multiple copies of same logical asset, it is advisable to identify an authoritative asset and certify it for other to provision.
Data Archival & Deletion – Often coming towards an end of a data life-cycle, this is often ignored. Curating the active period for the use-case will help the Data management team in archiving such data that need not be maintained.
Data Quality – Moreover, data might not be of significant quality that can provide an outcome on Artificial Intelligence or Machine Learning Models. The focus must be on profiling the data, understanding characteristics and monitoring quality through rules. Cleansing should not just be on the copies but also on the authoritative sources.

External

This content is provided by an external author without editing by Finextra. It expresses the views and opinions of the author.

5459

Report

Channels

/sustainable /devops

Analytics in Banking

This is for discussion and sharing of views on trends, practices and views in analytics in banking and financial industry

Join group

49 opinions 10 members 13 June 2025

Comments: (0)

Tejasvi Addagada

Enterprise Data Head

Fortune 500 financial service provider

Member since

02 Sep 2014

Location

Mumbai

More expert opinions

Sergiy Fitsak Managing Director, Fintech Expert at Softjourn