Over the past several months, I’ve been talking about the four “V’s” of Big Data, touching on two categories at the core of the definition of this much talked about term. I’ve detailed strategies in managing the most literal component of Big Data – volume;
and provided insight and tactics on how to tackle its many types of structured and unstructured content, otherwise known as variety. In this blog installation, I'd like to combine the remaining two “V's” – velocity and value – for reasons that I'll explain
in a moment. But first, let's examine them separately:
As much as Big Data is about volume, velocity is in fact an even more challenging aspect, especially when it comes to our industry. While volumes of data have certainly grown exponentially over the years, the speed at which data is being thrown at our systems
has accelerated even more. While some argue that certain market volumes are stabilizing; there is no end in sight to the increasing velocity with which we have to process data. Whether stemming from the low latency race on the Capital Markets side, or from
expanded use of "ubiquitous computing" on the banking side, data is being generated at ever increasing speed. But there's yet another side to velocity which is causing a fundamental shift in the industry – the speed at which data needs to be analyzed.
The recent regulatory focus on data is shifting compliance from periodic standardized reporting to on-demand analytics, which necessitates slicing the data in "real time." The latter term is used loosely in this context (differentiated from real-time systems),
referring to analytics that are performed against current data, rather than “stale” data views typical in legacy reporting systems.
The need to go beyond legacy capabilities became apparent in 2008, when many firms were tasked with figuring out their exposure to certain counterparties, across all interactions, asset classes, geographical locations, etc. The heroic efforts required to
answer these questions served as a wake-up call to the entire industry, and current regulations are trying to address this by forcing financial institutions to get a grip on their data. However, it's not only regulators that are putting more emphasis on data.
Firms themselves have realized just how critical the ability to get an accurate, timely view of their exposures really is, and as a result are demanding a significant increase in the speed of data analysis, requiring that overnight calculation and reporting
cycles historically taking hours to aggregate and analyze, be reduced to minutes or less.
Now let's shift our focus to value (which, unlike all the other “V's” is low rather than high). Low value, or low information density, is common to Big Data use cases involving social media, where the signal-to-noise-ratio is very low. In our industry, one
could argue that no data is unimportant – every piece of data is valuable to someone. This is certainly a true statement, but within the context of any specific analysis one is still faced with a low signal to noise ratio. Consider looking for a correlation
within a large historical market data set, performing sentiment analysis, or searching for e-mails associated with a transaction facing legal scrutiny (all based on real-life use cases).
In each of these cases, the system has to sift through vast amounts of irrelevant data to find the few pieces that are important, and this is where value ties in to velocity. From a technology perspective, both are about high velocity analytics, and I would
argue that this aspect of Big Data – which is one of the most challenging – is also the one that, at the end of the day, is most visible to users.
Some of the technologies and patterns discussed in previous posts are quite applicable to this challenge:
- Data grids, map-reduce frameworks, and data warehouses working in concert to move compute tasks closer to the data, rather than the other way around.
- "Real-time data warehousing" based on change data capture rather than ETL
- In-database analytics, including OLAP, data mining, statistical analysis and semantics
- Engineered machines, combining hardware, software and networking to achieve extreme performance
In addition, I'd like to mention a couple of other technologies that are particularly relevant to high velocity analytics.
Some engineered machines available today go the extra mile by integrating in-memory databases and business intelligence (BI) tools. Since the latter have knowledge about data access patterns, they can capture the most relevant data in an in-memory database
and analyze it at very high speeds. This goes above and beyond typical database caching, which is dependent on query properties. In this case, the BI server has knowledge about the actual usage of the data, and is capable of making "educated guesses" based
on user interaction and the flow of information. Essentially, this puts the tool most visible to the user in charge of low-level data movement and caching mechanisms, dissolving the boundaries between categories of tools. This is yet another example of the
concept of the data management technology continuum, which is so important when it comes to Big Data problems.
Event Driven Architecture
While the term event driven architecture may harken back to the days when SOA was at the top of the hype curve, it is actually still quite relevant today, and there is much more to it than one may realize. It is fundamentally about a push model of computing,
and while it's been well-established in certain areas (most graphical user interfaces, including Rich Internet Applications, are based on this notion), in the context of data management it hasn't been getting a lot of attention. It is interesting to consider
that beyond object eventing and enterprise messaging (the most common form of EDA), there are other technologies that implement this architecture using very different mechanisms. A couple mentioned before actually fall into this category – certain data grids
provide observation mechanisms (using JMX, etc.) which can be used to trigger events based on changes in the underlying data objects. Similarly, change data capture mechanisms provided with certain data integration tools can implement EDA by triggering events
such as transformation and computation in response to data changes transferred from the source data stores by the tool (check out my earlier posts for more on that).
Complex Event Processing (CEP) is one technology that hasn't been getting as much attention in this space either. In the context of data management, CEP presents a revolutionary notion of streaming data through algorithms rather than using algorithms to
query the data. These days it is commonly used in the front office for market data processing and algorithmic trading, yet it remains to be seen whether CEP can also take hold in the middle and back office and fulfill its potential for data management and
Clearly, there's a plethora of technologies available to tackle on-demand, Big Data challenges. But there is one aspect that's more important than technology, which of course is the human one. Let's not forget that some of the data management challenges
that plague our industry have resulted directly from the fractured organizational structure common to most financial institutions, and the proprietary nature of their business. There has always been a reluctance to share information, even within the firm,
and a tendency toward ambiguity and complexity when facing the outside world. Without getting into a discussion about the nature of our industry, it is clear that some of this "nature" will have to change. The regulators seem to understand this, as do the
upper management of most institutions. The big question is whether the rest of the organization – including IT – can follow suit?