Blog article
See all stories »

Data Lakes May be Failing Banks, but it Doesn’t Need to be This Way

The reality for many banks is that data lakes are becoming data swamps, with business users unable to access the large volumes of machine-generated, clickstream, and externally generated data that lives amongst their depths.

Regardless of their good intentions, data lakes have become bottlenecks at most banks simply because it’s too difficult for business users to retrieve the information stored within them. The end result is that data lakes fail to support the speed-to-market requirements that the new product and service analytics-driven revolution requires.

So what’s the root cause of this large scale failure, and how can banks get to a state where they can focus on building game-changing analytics and improving data quality? Here, I investigate three of the common data lake fails happening at banks, while also sharing some tested advice on exactly what banks can do to prevent data lake disasters from the start, or to fix data swamps indefinitely.

Problem #1: Inexperienced Engineers & No Real Hadoop Experience

Data lakes at most banks could do with better design and implementation. Simply put, even amongst their large pools of engineers, banks often struggle to find the Hadoop experience which is required for successful data lake implementations.

The proof is usually in the pudding: when banks are struggling to get value out of their data lakes, it often turns out the engineering team was using Hadoop technologies for the first time in the build. It’s hardly surprising – Hadoop technologies are relatively new and don’t have much in common with traditional technology stacks.

This lack of appropriate engineering skills is leading to data lake implementations that have poor architectural design, poor integration, poor scalability and poor testability. Combined, these shortcomings are creating completely unstable data lake environments that often end up becoming completely useless, losing millions and missing an opportunity to turn valuable data into positive business outcomes.

Unfortunately, many banks also lack the knowledge to identify which skills may make a great data lake engineer. Simply understanding the relevant technologies, for example Spark and HBase, is just not enough, and may lead to hiring inexperienced engineers.

The Fix

Banks need to invest in data engineers that have the background and experience to truly understand Hadoop technologies. It’s also advisable to invest in consultants who have experience with some of the many data lake platforms on the market to ensure efficient deployment and delivery of ROI. 

Problem 2#: Missing Foundational Capabilities and Immtaure Operating Models 

There is a general tendency to underestimate the complexity of data lake solutions from a technical and engineering perspective. Every data lake should expose a good number of technical capabilities: some of these are self-service data ingest, data preparation, data profiling, data classification, data governance, data lineage, metadata management, global search and security. 

Just remember that while some tools can help to deploy these capabilities, they do not provide complete implementations of the solutions.

The Fix

Data must be moved into data lakes in a very considered way. Has it been cleansed? Has it been validated, profiled and indexed? And has it been secured and tracked by extensive metadata? The problem is that many banks are performing these tasks as afterthoughts, and it’s an expensive and time intensive way to build a data lake, and one that often leads to a solution that encourages loss of data quality.

Problem #3: Banks Aren't Focusing on Data Governance 

When it comes to data, many banks do not have a documented set of processes that outline how essential data assets are to be managed, and how data quality is to be maintained. This is a huge error and one that removes what should be an essential focus for any bank: data quality, and accountability.

It should be no surprise that the lack of data governance is the root cause of many data lake failures, with no focus on enterprise and control of data. Banks need to consider how universally data is accessed through their organisations and through what number of applications. During the initial phase of any data lake implementation, there is often not enough focus on how to organise and control data. Given data is to be accessed by multiple users through several applications, governance is essential.

The Fix

Banks need to think before they move data into a data lake – governance can’t be an afterthought. Hadoop is simple storage system which is amenable to all sorts of control mechanisms, so data should be ingested according to a plan that leads with governance.

These are only three of the major dilemmas many banks face as they struggle to build or fix custom engineered data lakes. As a new generation of data lakes emerge that focus on tactical data ingestion planning, data quality and security, banks will realise new opportunities to start from a better foundation, or to get the help available to turn data lakes into valuable resources.

Hadoop engineering will mature as banks learn from previous failures and hire new talent with more experience. Those banks that can achieve this vision to make data lakes user-centric and useful will be at the cutting edge of analytics-driven product and service innovation, safeguarding what is surely one of the most significant technology investments and ensuring ROI is delivered quickly and efficiently. 



Comments: (3)

Ketharaman Swaminathan
Ketharaman Swaminathan - GTM360 Marketing Solutions - Pune 24 May, 2017, 18:58Be the first to give this comment the thumbs up 0 likes

In principle these problems are not unique to data lakes. In one form or the other, they've plagued data mart, data warehouse and virtually every analytics initiative since the dawn of data. To me, it's important to deep dive into the root causes of these problems. I found this HBR article useful in that context. If companies decide upfront the degree of Data Offence v. Data Defence they want to strike for a given data management initiative, these initiatives will automatically enjoy a higher success rate.

On a side note, I respectfully disagree with Fix #3. By definition, data lake is a "storage repository that holds a vast amount of raw data in its native format". IMO, subjecting data to prior governance runs counter to the way in which a data lake should be built.

A Finextra member
A Finextra member 25 May, 2017, 20:47Be the first to give this comment the thumbs up 0 likes

Point #3 isn't really about "subjecting data to prior governance" but about having proper governance capabilities built-in, within the Big Data platform, which doesn't go against the schema-on-read concept. It is actually a highly recommended best practice for turning a Data Lake into a Data Reservoir.

Ketharaman Swaminathan
Ketharaman Swaminathan - GTM360 Marketing Solutions - Pune 26 May, 2017, 12:45Be the first to give this comment the thumbs up 0 likes

Okay, thanks, so what does "data should be ingested according to a plan that leads with governance" mean then? For the moment, I'm still on Data Lake, so Data Reservoir - a term I'm hearing for the first time - best practices is a different thing.

Retired Member

Member since

19 Mar 2009


Blog posts




This post is from a series of posts in the group:

Data Management 101

A community blog about data and how to manage it

See all