Join the Community

23,418

Expert opinions

41,897

Total members

297

New members (last 30 days)

174

New opinions (last 30 days)

29,118

Total comments

Join Sign in

The Truth About Hadoop

2 27 October 2014 1 comment

Retired member

The press often talks about Hadoop as if it were a single technology or even a product. This is far from the truth and creates a false expectation in the industry about the complexity of using Hadoop in real-life situations. I thought I would clarify this and provide a quick overview of the key components of Hadoop and how they can be used in the financial services industry.

To start, Hadoop is not a single technology or product; it is a framework of related open source technologies that provides reliable and scalable distributed computing.

Hadoop is based upon two core technologies:

HDFS (Hadoop Distributed File System) – a distributed file system. We have used HDFS, for example, to store XML files representing a huge archive of trade data. HDFS permits years and years of data to be stored, current and archived data together, something a single file system or database is unable to do.
MapReduce – a programming model for distributed computing. MapReduce provides a Java API to break an “embarrassingly parallelizable” job into pieces and federate it to a number of servers. We have used MapReduce to improve the performance of ETL processes, for example the daily loading, standardization, and enrichment of data of tens of millions trades collected from across a large investment bank.

With HDSFS and MapReduce we can do a lot, but there are many other tools in the Hadoop shed which are necessary to build a complete, enterprise-ready solution:

Sqoop – bulk transfers of data between HDFS and relational databases
Flume – collection and streaming of data into HDFS
Oozie – coordination and scheduling of MapReduce jobs so they can be linked together to create a more complex workflow
Zookeeper – maintenance of configuration information and synchronizing services across Hadoop

HDFS and MapReduce are also very “low-level” technologies which aren’t always very easy to use, especially for non-technical folk. Many tools have been built on top of them to provide a simpler interface, in particular for data warehousing and data analytics:

HBase – a distributed database for storing huge volumes of tabular data (“millions of columns, billions of rows”).
Hive – data warehouse infrastructure for ad-hoc querying. With its similar syntax, Hive is very useful in the hands of analysts who already dominate SQL.
Pig – data flow execution framework and procedural language which enables data processing and analysis. Like Hive, Pig creates MapReduce jobs in execution and is a good way to get away from low level programming.
Mahout – a scalable machine learning and data mining library. Mahout provides algorithms for clustering, classification, and collaborative filtering and is helpful in sorting through huge volumes of data to extract insight.

As you see, Hadoop isn’t one thing – it’s a collection of tools and technologies. When a client says “we want to use Hadoop”, I know there are still a lot of technology choices remaining. We work with our clients to understand their problem and design a solution which brings all the different pieces together. It definitely isn’t as easy as the press makes it seem.

External

This content is provided by an external author without editing by Finextra. It expresses the views and opinions of the author.

7525

Report

Channels

Innovation in Financial Services

A discussion of trends in innovation management within financial institutions, and the key processes, technology and cultural shifts driving innovation.

Join group

1810 opinions 264 members 14 hours

Comments: (2)

Ketharaman Swaminathan Founder and CEO at GTM360 Marketing Solutions

27 October 2014

Great article. I used to think Hadoop was a powerful software but, from some of the use cases it's used for, I began to have doubts. So much so that I was tempted to write an article that took a dig at Hadoop by asking "What Is The Difference Between Hadoop And Excel?" and answering "Hadoop Thinks It's Excel But Excel Doesn't Think It's Hadoop!". I'll forthwith ditch that article!

That said, any idea what does a news item like "Hadoop Raises $MM Funding" mean? If there isn't a single entity called Hadoop, where does all that money go?

Report

A Finextra member

29 October 2014

Great overview. The technology is very powerful in the hands of people that know what they are trying to accomplish and what KPIs they intend to use to measure success.

Report

More expert opinions

Doriel Abrahams Principal Technologist at Forter