Blog article
See all stories »

The Truth About Hadoop

The press often talks about Hadoop as if it were a single technology or even a product. This is far from the truth and creates a false expectation in the industry about the complexity of using Hadoop in real-life situations. I thought I would clarify this and provide a quick overview of the key components of Hadoop and how they can be used in the financial services industry.

To start, Hadoop is not a single technology or product; it is a framework of related open source technologies that provides reliable and scalable distributed computing.

Hadoop is based upon two core technologies:

  • HDFS (Hadoop Distributed File System) – a distributed file system. We have used HDFS, for example, to store XML files representing a huge archive of trade data. HDFS permits years and years of data to be stored, current and archived data together, something a single file system or database is unable to do.
  • MapReduce – a programming model for distributed computing. MapReduce provides a Java API to break an “embarrassingly parallelizable” job into pieces and federate it to a number of servers. We have used MapReduce to improve the performance of ETL processes, for example the daily loading, standardization, and enrichment of data of tens of millions trades collected from across a large investment bank.

With HDSFS and MapReduce we can do a lot, but there are many other tools in the Hadoop shed which are necessary to build a complete, enterprise-ready solution:

  • Sqoop – bulk transfers of data between HDFS and relational databases
  • Flume – collection and streaming of data into HDFS
  • Oozie –  coordination and scheduling of MapReduce jobs so they can be linked together to create a more complex workflow
  • Zookeeper – maintenance of configuration information and synchronizing services across Hadoop

HDFS and MapReduce are also very “low-level” technologies which aren’t always very easy to use, especially for non-technical folk. Many tools have been built on top of them to provide a simpler interface, in particular for data warehousing and data analytics:

  • HBase – a distributed database for storing huge volumes of tabular data (“millions of columns, billions of rows”).
  • Hive – data warehouse infrastructure for ad-hoc querying. With its similar syntax, Hive is very useful in the hands of analysts who already dominate SQL.
  • Pig – data flow execution framework and procedural language which enables data processing and analysis. Like Hive, Pig creates MapReduce jobs in execution and is a good way to get away from low level programming.
  • Mahout – a scalable machine learning and data mining library. Mahout provides algorithms for clustering, classification, and collaborative filtering and is helpful in sorting through huge volumes of data to extract insight.

As you see, Hadoop isn’t one thing – it’s a collection of tools and technologies. When a client says “we want to use Hadoop”, I know there are still a lot of technology choices remaining. We work with our clients to understand their problem and design a solution which brings all the different pieces together. It definitely isn’t as easy as the press makes it seem.

 

7271

Comments: (2)

Ketharaman Swaminathan
Ketharaman Swaminathan - GTM360 Marketing Solutions - Pune 27 October, 2014, 14:53Be the first to give this comment the thumbs up 0 likes

Great article. I used to think Hadoop was a powerful software but, from some of the use cases it's used for, I began to have doubts. So much so that I was tempted to write an article that took a dig at Hadoop by asking "What Is The Difference Between Hadoop And Excel?" and answering "Hadoop Thinks It's Excel But Excel Doesn't Think It's Hadoop!". I'll forthwith ditch that article!

That said, any idea what does a news item like "Hadoop Raises $MM Funding" mean? If there isn't a single entity called Hadoop, where does all that money go? 

A Finextra member
A Finextra member 29 October, 2014, 15:33Be the first to give this comment the thumbs up 0 likes

Great overview.  The technology is very powerful in the hands of people that know what they are trying to accomplish and what KPIs they intend to use to measure success.

Member since

0

Location

0

More from member

This post is from a series of posts in the group:

Innovation in Financial Services

A discussion of trends in innovation management within financial institutions, and the key processes, technology and cultural shifts driving innovation.


See all

Now hiring