As we start off the week at SIBOS, big data is a big focus. This year, a deep dive into big data is taken during an Innotribe session at the conference, part of a larger initiative to enable collaboration in financial services. As such, it seems fitting
to share our thoughts on the subject here.
Systemic risk was at the heart of the financial crisis of 2008, and current regulatory and industry efforts are focusing on getting a more accurate view of risk exposures across asset classes, lines of business, and organizations in order to predict systemic
Managing large amounts of structured and unstructured data (including reference data, over-the-counter (OTC) contracts, positions data, etc.) is a key aspect of these efforts, and is one of the reasons data management is suddenly becoming top of mind in
After many years relegated to the back burner, we are finally at a point where data is considered a key contributing factor into business processes, with full awareness at the top of executive ranks.
Big data is climbing up the Gartner Hype Cycle, and as such has many different definitions and associated opinions. However, the following three key aspects seem to emerge:
1. Large, distributed aggregations
After spending much time and effort building high performance compute grids to handle complex valuations and analytics, we are now at the point where data is becoming a real bottleneck. And getting large amounts of high quality data to feed these compute
tasks in a timely manner is a big challenge facing our industry.
Given that, most of us have realized we have to start bringing the compute to the data rather than the other way around, since it's usually much easier to ship compute tasks over the wire than it is to move large volumes of data. This applies in both the
relational and non-relational realms:
- Pre-engineered database machines can push query execution right to the storage nodes, using their "brain power" to minimize the amount of data that has to be shipped over the network while drastically compressing the amount of data that does.
- Data grids allow programmers to handle compute tasks and their associated data on the same highly distributed platform, moving the tasks rather than large sets of data back and forth.
- NoSQL technologies are getting most of the attention these days, and are taking a similar approach – using different mechanisms such as distributed file systems as their underlying storage platform, and mapping compute tasks to the nodes storing the data
– to minimize data movement.
It's worth noting that each of these three approaches is applicable for different use cases, (e.g.: a database machine for analytical dashboards, a data grid for real-time position capturing, a NoSQL implementation for large scale batch analytics, etc.).
They are by no means mutually exclusive, and could furthermore be combined to tackle more complex use cases as detailed below.
2. Loosely structured data
Most of the attention garnered by big data, up to this point, has focused on SQL vs. NoSQL, which in my opinion is a bit misguided. While there's a tremendous opportunity in utilizing NoSQL technologies for analytical use cases involving loosely structured
data, there are also many use cases where ad-hoc querying using tried-and-true SQL is essential. As evidence, consider all the efforts to develop a standard NoSQL query language. In fact, I don't see this as an "either or" situation at all – the real power
comes from combining these technologies (e.g. providing business insight using BI tools from a high-performance data warehouse fed by a NoSQL technology) to tackle use cases that could not be addressed in the past.
Semantic technology is another set of technologies especially suited for handling and linking non-structured data, but, it is not getting much exposure as part of this discussion.
Several firms have been using this technology for market sentiment analysis (gleaning it from unstructured media content), but only recently its suitability for handling internal non-structured data, such as OTC contracts, has been recognized. This is notable,
as handling OTC derivatives data is a key part of Dodd-Frank regulations, and organizations like the Enterprise Data Management Council have been making inroads to effectively model these instruments using semantic technology.
3. Incomplete, difficult to access data
This has been the dirty little secret of recent "paradigm shifts" in IT. Whether it's SOA or cloud computing, sooner or later one has to face the tough challenge of enterprise data management, especially the integration of difficult to access, incomplete
data. I recall a conversation with the CIO of a major bank after he finished a large SaaS implementation. “The project went brilliantly,” he said, “until we had to start integrating it with our internal systems.”
While SOA, REST, etc. provide good mechanisms for application integration, the time has finally come to focus our attention as an industry on data integration. Going back to the business motivation – requiring aggregations of vast amounts of disparate data
in a timely manner – it is clear that this is where we need to make the most progress. Overnight extract, transform, and load jobs just don't cut it anymore.
Fortunately there are alternatives. Technologies based on change data capture are now being applied to the data integration problem, after being used mostly for disaster recovery replication in the past. The ability to use change data capture to replicate
transactional data into the staging schema of a data warehouse is a radical approach, but one that is finally making the vision of real-time data warehousing a reality. By combining it with the other technologies mentioned above, we can extract, harmonize,
and analyze vast amounts of disparate, loosely structured data in a continuous, on-demand manner, making it available to executives and regulators alike.
In fact, some organizations are already getting close to realizing this vision, benefiting from continuous valuations and pricing to reduce market risk, exercising more stringent pre-trade risk management, reporting their liquidity risk intraday based on
fresh data, and reducing their counterparty risk by getting a complete, cross-asset view of their exposures. Hopefully, with more firms joining the trend to tackle big data head-on, and with continued collaboration between the industry and the regulators,
real progress can be made in the financial system's ability to handle future crisis situations.
This blog was contributed by Amir Halfon, Senior Director of Technology, Capital Markets, Oracle Financial Services.