Blog article
See all stories »

Managing the 'Big' in Big Data

My first post on the topic of “Big Data” discussed the context of the "four Vs" of Big Data in financial services: Volume, Velocity, Variety & Value. With the latest developments in the European debt crisis – including credit rating downgrades in key ‘AAA’ nations – echoing across the Eurozone and to the United States, the importance and necessity of managing the large amounts of data tied to risk exposures is apparent now more than ever.

As I mentioned in my previous blog, the efficient allocation of capital and prime risk-adjusted performance depends primarily on the ability to gain a holistic view of exposures and positions, which require rapid, timely, and aggregated access to large amounts of financial data that are growing exponentially in an inter-connected, complex, and global economy. The challenge that many countries – and systems – are facing is how to keep up with the sheer amount of data compounding every day, hour, minute, and even second, while managing core tasks such as regulatory reporting and analytics that are paramount to maintaining operations.

So, for this second installment, I'd like to focus on the first and seemingly most obvious of the “four Vs” – volume – and talk about the technologies associated with processing very large amounts of data, including overall patterns and approaches.

Two main patterns are apparent when dealing with large volumes. The most obvious is parallelism, and while we spend a lot of effort as an industry parallelizing computation, data parallelism remains a challenge and is the focal point of most current solutions. Additionally, it's becoming apparent that in many cases compute grids are bottlenecking data access. Therefore the pattern of moving compute tasks to the data rather than moving large amounts of data over the network is also becoming paramount.

Several technical approaches combine these patterns, parallelizing both data and computation while bringing the compute tasks closer to the data. Let's examine them in more detail:

Engineered Machines

Regardless of product specifics, the philosophy of pre-engineering a machine for high throughput data processing is becoming prominent in the industry. Pre-engineering combines data and compute parallelization with partitioning, compression and a high-bandwidth backplane to provide very high throughput data processing capabilities. Some of these pre-engineered machines are actually able to send query and analytics execution to the storage nodes that hold the data, thus radically minimizing data movement.

Integrated Analytics

Whether using engineered machines or not, the concept of performing analytics right on the data management system is a very powerful one, again following the philosophy of moving the compute to the data rather than the other way around. Whether it's ROLAP, MOLAP, predictive or statistical analytics, today's relational database management systems are capable of doing a lot right where the data is located. And in doing so, several of them are actually integrating the data parallelism mechanisms with the analytical engines themselves.

The combination of high throughput analytics with engineered machines has enabled several financial firms to dramatically reduce the time it takes for analytical workloads to run. Whether its EOD batch processing, on-demand risk calculation, or pricing and valuations, firms are able to do a lot more in much less time, directly affecting the business by enabling continuous, on-demand data processing.

Data Grids

Unlike compute grids. data grids focus on the challenge of parallelizing data management, and some of them provide the ability to ship compute tasks to the nodes holding the data in memory rather than sending data to compute nodes as most compute grids do. This is again based on the principle that it's cheaper to ship a compute task method than it is to move large amounts of data across the wire. Several capital market firms have been using data grids to centralize market data as well as positions data across desks and geographies, and some go even further than that to execute certain analytics on the nodes where the data resides, achieving a real-time view of exposures, P&L, and other calculated metrics.


The concept of schema-less data management (which is what NoSQL is really all about) has been steadily gaining momentum in recent years. At its core is the notion that developers can be more productive by circumventing the need for complex schema design during the development lifecycle of data-intensive applications, especially when the data lends itself to being modeled in key-value pairs (e.g. time series data). Despite being based on different principles, many of these technologies essentially follow a similar philosophy for data grids: they distribute data horizontally across many nodes and model it in an Object-Oriented rather than a relational manner.

It is important to keep in mind that the NoSQL debate is not necessarily a debate at all, and in fact NoSQL technologies become much more powerful when combined with traditional data warehousing and business intelligence tools. In fact, I tend to view these technologies in a continuum rather than a dialectic opposition.

In my future posts, I’ll delve into some more details on this topic – particularly in relation to Hadoop, which is quickly becoming a de-facto standard within this realm – and continue the discussion on the “four V’s” of big data.

a member-uploaded image

Comments: (2)

A Finextra member
A Finextra member 30 November, 2011, 14:54Be the first to give this comment the thumbs up 0 likes

We see huge Big Data movement in web analytics, SEO, CRM, all these horizontal solutions, but what are the Big Data concrete examples and successful business cases specifically in commercial and investment banking?

Agreed that NoSQL complements and doesn't replace relational and OLAP databases.

A Finextra member
A Finextra member 30 November, 2011, 18:59Be the first to give this comment the thumbs up 0 likes

Good point. Here are some examples:


 - Rogue trading detection - based on transaction and acounting records 

 - Monte Carlo Simulation and other predictive analytics (both for strategy development and risk management)

 - VAR calculation across asset classes - pulling positions data from different systems and analyzing them against pricing data

 - Time series analytics - storing the raw data in HDFS and transforming it to required format using mapreduce



 - Loan Risk Analytics and Profiling (retail and commercial banking)

 - Fraud detection: internet and credit card fraud detection based on POS records; AML

 - Data-driven products (credit card, loans, etc), based on detailed customer analytics across all interactions

  - Mainframe offloading - moving sequence files to HDFS and executing batch calculations and statistical analysis on the Hadoop cluster