The reality for many banks is that data lakes are becoming data swamps, with business users unable to access the large volumes of machine-generated, clickstream, and externally generated data that lives amongst their depths.
Regardless of their good intentions, data lakes have become bottlenecks at most banks simply because it’s too difficult for business users to retrieve the information stored within them. The end result is that data lakes fail to support the speed-to-market
requirements that the new product and service analytics-driven revolution requires.
So what’s the root cause of this large scale failure, and how can banks get to a state where they can focus on building game-changing analytics and improving data quality? Here, I investigate three of the common data lake fails happening at banks, while
also sharing some tested advice on exactly what banks can do to prevent data lake disasters from the start, or to fix data swamps indefinitely.
Problem #1: Inexperienced Engineers & No Real Hadoop Experience
Data lakes at most banks could do with better design and implementation. Simply put, even amongst their large pools of engineers, banks often struggle to find the Hadoop experience which is required for successful data lake implementations.
The proof is usually in the pudding: when banks are struggling to get value out of their data lakes, it often turns out the engineering team was using Hadoop technologies for the first time in the build. It’s hardly surprising – Hadoop technologies are relatively
new and don’t have much in common with traditional technology stacks.
This lack of appropriate engineering skills is leading to data lake implementations that have poor architectural design, poor integration, poor scalability and poor testability. Combined, these shortcomings are creating completely unstable data lake environments
that often end up becoming completely useless, losing millions and missing an opportunity to turn valuable data into positive business outcomes.
Unfortunately, many banks also lack the knowledge to identify which skills may make a great data lake engineer. Simply understanding the relevant technologies, for example Spark and HBase, is just not enough, and may lead to hiring inexperienced engineers.
Banks need to invest in data engineers that have the background and experience to truly understand Hadoop technologies. It’s also advisable to invest in consultants who have experience with some of the many data lake platforms on the market to ensure efficient
deployment and delivery of ROI.
Problem 2#: Missing Foundational Capabilities and Immtaure Operating Models
There is a general tendency to underestimate the complexity of data lake solutions from a technical and engineering perspective. Every data lake should expose a good number of technical capabilities: some of these are self-service data ingest, data preparation,
data profiling, data classification, data governance, data lineage, metadata management, global search and security.
Just remember that while some tools can help to deploy these capabilities, they do not provide complete implementations of the solutions.
Data must be moved into data lakes in a very considered way. Has it been cleansed? Has it been validated, profiled and indexed? And has it been secured and tracked by extensive metadata? The problem is that many banks are performing these tasks as afterthoughts,
and it’s an expensive and time intensive way to build a data lake, and one that often leads to a solution that encourages loss of data quality.
Problem #3: Banks Aren't Focusing on Data Governance
When it comes to data, many banks do not have a documented set of processes that outline how essential data assets are to be managed, and how data quality is to be maintained. This is a huge error and one that removes what should be an essential focus for
any bank: data quality, and accountability.
It should be no surprise that the lack of data governance is the root cause of many data lake failures, with no focus on enterprise and control of data. Banks need to consider how universally data is accessed through their organisations and through what
number of applications. During the initial phase of any data lake implementation, there is often not enough focus on how to organise and control data. Given data is to be accessed by multiple users through several applications, governance is essential.
Banks need to think before they move data into a data lake – governance can’t be an afterthought. Hadoop is simple storage system which is amenable to all sorts of control mechanisms, so data should be ingested according to a plan that leads with governance.
These are only three of the major dilemmas many banks face as they struggle to build or fix custom engineered data lakes. As a new generation of data lakes emerge that focus on tactical data ingestion planning, data quality and security, banks will realise
new opportunities to start from a better foundation, or to get the help available to turn data lakes into valuable resources.
Hadoop engineering will mature as banks learn from previous failures and hire new talent with more experience. Those banks that can achieve this vision to make data lakes user-centric and useful will be at the cutting edge of analytics-driven product and
service innovation, safeguarding what is surely one of the most significant technology investments and ensuring ROI is delivered quickly and efficiently.