My first post on the topic of “Big Data” discussed the context of the "four Vs" of Big Data in financial services: Volume, Velocity, Variety & Value. With the latest developments in the European debt crisis – including credit rating downgrades in key ‘AAA’
nations – echoing across the Eurozone and to the United States, the importance and necessity of managing the large amounts of data tied to risk exposures is apparent now more than ever.
As I mentioned in my previous blog, the efficient allocation of capital and prime risk-adjusted performance depends primarily on the ability to gain a holistic view of exposures and positions, which require rapid, timely, and aggregated access to large amounts
of financial data that are growing exponentially in an inter-connected, complex, and global economy. The challenge that many countries – and systems – are facing is how to keep up with the sheer amount of data compounding every day, hour, minute, and even
second, while managing core tasks such as regulatory reporting and analytics that are paramount to maintaining operations.
So, for this second installment, I'd like to focus on the first and seemingly most obvious of the “four Vs” – volume – and talk about the technologies associated with processing very large amounts of data, including overall patterns and approaches.
Two main patterns are apparent when dealing with large volumes. The most obvious is parallelism, and while we spend a lot of effort as an industry parallelizing computation, data parallelism remains a challenge and is the focal point of most current solutions.
Additionally, it's becoming apparent that in many cases compute grids are bottlenecking data access. Therefore the pattern of moving compute tasks to the data rather than moving large amounts of data over the network is also becoming paramount.
Several technical approaches combine these patterns, parallelizing both data and computation while bringing the compute tasks closer to the data. Let's examine them in more detail:
Regardless of product specifics, the philosophy of pre-engineering a machine for high throughput data processing is becoming prominent in the industry. Pre-engineering combines data and compute parallelization with partitioning, compression and a high-bandwidth
backplane to provide very high throughput data processing capabilities. Some of these pre-engineered machines are actually able to send query and analytics execution to the storage nodes that hold the data, thus radically minimizing data movement.
Whether using engineered machines or not, the concept of performing analytics right on the data management system is a very powerful one, again following the philosophy of moving the compute to the data rather than the other way around. Whether it's ROLAP,
MOLAP, predictive or statistical analytics, today's relational database management systems are capable of doing a lot right where the data is located. And in doing so, several of them are actually integrating the data parallelism mechanisms with the analytical
The combination of high throughput analytics with engineered machines has enabled several financial firms to dramatically reduce the time it takes for analytical workloads to run. Whether its EOD batch processing, on-demand risk calculation, or pricing and
valuations, firms are able to do a lot more in much less time, directly affecting the business by enabling continuous, on-demand data processing.
Unlike compute grids. data grids focus on the challenge of parallelizing data management, and some of them provide the ability to ship compute tasks to the nodes holding the data in memory rather than sending data to compute nodes as most compute grids do.
This is again based on the principle that it's cheaper to ship a compute task method than it is to move large amounts of data across the wire. Several capital market firms have been using data grids to centralize market data as well as positions data across
desks and geographies, and some go even further than that to execute certain analytics on the nodes where the data resides, achieving a real-time view of exposures, P&L, and other calculated metrics.
The concept of schema-less data management (which is what NoSQL is really all about) has been steadily gaining momentum in recent years. At its core is the notion that developers can be more productive by circumventing the need for complex schema design
during the development lifecycle of data-intensive applications, especially when the data lends itself to being modeled in key-value pairs (e.g. time series data). Despite being based on different principles, many of these technologies essentially follow a
similar philosophy for data grids: they distribute data horizontally across many nodes and model it in an Object-Oriented
rather than a relational manner.
It is important to keep in mind that the NoSQL debate is not necessarily a debate at all, and in fact NoSQL technologies become much more powerful when combined with traditional data warehousing and business intelligence tools. In fact, I tend to view these
technologies in a continuum rather than a dialectic opposition.
In my future posts, I’ll delve into some more details on this topic – particularly in relation to Hadoop, which is quickly becoming a de-facto standard within this realm – and continue the discussion on the “four V’s” of big data.