As Banks and Insurance firms have already embraced Data Lakes for their Artificial Intelligence and Machine learning capabilities, it is important to look for continuous Return on Investment on the platform.
If a Data Lake is not well maintained, it can turn into a swamp while finding usable data can confuse the data consumers. Most challenges can be solved by including an active platform governance of the Data Lake.
A data lake as a distributed file system hosts authoritative copies of source data having a variety of data that include assorted formats including structured, semi-structured formats like a JSON, XML and unstructured data like images, audio.
Accumulating technical debt with business use-cases will often lead to increased up-front costs during migration and maintenance costs of existing data.
Lack of data-trust often leads to consumers getting their own copies of data onto the data lake though they might exist already. However, due to lack of self-service discovery capabilities – other consumers might not be able to find the right dataset.
The focus areas of a data lake Technology operating model should be on the below aspects of Data Management –
- Data Cataloging – A know-how on where the data is coming from is not available after ingesting and building pipelines. Also, what data exists in the lake and relevant business context of the data being applied there is required.
- DataReuse – Before ingesting Data, it is always advisable to see if an existing coverage for data is available through discovery. If a data-asset exists, it should be re-used.
- Data redundancy – Maintaining multiple copies of same data for different use-cases can be high on the data management cost including Data Quality and Metadata Management.
- Investment can be made in a Business Information Model rather than maintaining redundant data on the cluster
- Physical replication of a data asset on multiple Data Nodes, is a best practice configured for reliability & Fault tolerance. This aspect is different from maintaining different copies of the same asset by data providers or consumers.
- Authoritative copy certification – When the data lake has been active for some time, and there are multiple copies of same logical asset, it is advisable to identify an authoritative asset and certify it for other to provision.
- Data Archival & Deletion – Often coming towards an end of a data life-cycle, this is often ignored. Curating the active period for the use-case will help the Data management team in archiving such data that need not be maintained.
- Data Quality – Moreover, data might not be of significant quality that can provide an outcome on Artificial Intelligence or Machine Learning Models. The focus must be on profiling the data, understanding characteristics and monitoring quality through rules.
Cleansing should not just be on the copies but also on the authoritative sources.