In my last post, I discussed the challenges, patterns, and approaches to dealing with the most obvious of the “four V’s” of big data – volume – briefly touching on the subject of unstructured data and schema-less repositories. In this installment, I'd like
to focus on this topic in a bit more detail, and look at the next aspect of Big Data – variety – from several angles.
Variety refers to various degrees of structure (or lack thereof) within the source data. While much attention has been given to loosely-structured Web data, whether sourced from the Web itself (social media, etc.), or from Web server logs, I'd like to turn
to the topic of unstructured data within the financial institution’s firewall and focus on the challenge of linking diverse data with various levels of structure, rather than discussing the storage and analysis of unstructured data as a standalone problem.
Let’s start with a couple of examples. Financial institutions are under pressure to retain all interaction records related to transactions, such as phone calls, emails, instant messages, etc. Recently, more attention has been given to the linkage between
these records and the corresponding transactions handled by trade capture systems, order management systems, and the like. There is growing realization that regulations such as Dodd-Frank will require this linkage to be not only established, but readily available
for on-demand reporting. Aside from regulatory compliance, interaction records can also be quite useful for rogue trading and other fraud detection analysis once effectively linked to transactional data.
OTC derivatives are another interesting example. Bilateral contracts, in essence, they contain critical information within their legal text, which must be deciphered to become usable for analytics. Much regulatory attention has been given to some of these
deciphering instruments (such as swaps), in an effort to increase standardization and make them more transparent. Even as we move toward a central counter-party model, many derivatives remain quite obscure in terms of their core data elements. This is especially
true when derivation relationships have to be traversed in order to get to a complete picture of risk exposure and other aggregate data.
These instances make a case for the challenge, as well as the importance, of integrating structured and loosely unstructured data. With that in mind, I'd like to discuss a few enabling technical strategies.
As mentioned in my previous post, I see these technologies as a continuum of tools that can work in concert. Many customers using MapReduce and other schema-less frameworks have been struggling with their outputs with structural data and analytics coming
from the RDBMS side. It’s becoming clear that, rather than choosing one over the other, the integration of the relational and non-relational paradigms provides the most powerful analytics by bringing together the best of both worlds.
There are several technologies that enable this integration; some of them fall into the traditional ETL category, while others take advantage of the processing power of MapReduce frameworks like Hadoop to perform data transformation in-place, rather than
doing it in a separate middle tier. Some tools combine this capability with in-place transformation at the target database as well, taking advantage of the compute capabilities of engineered machines and using change data capture to synchronize, source, and
target, again without the overhead of a middle tier. In both cases, the overarching principle is real-time data integration, reflecting data changes instantly in a data warehouse – whether originating from a MapReduce job or from a transactional system – so
that downstream analytics have an accurate, timely view of reality.
Linked Data and Semantics
Linking data sets via semantic technology has been getting a lot of traction within the biomedical industry and is gaining momentum within financial services, thanks to the Enterprise Data Management Council and uptake from developers.
The broader notion of pointing at external sources from within a data set has been around for quite a long time, and the ability to point to unstructured data (whether residing in the file system or some external source) is merely an extension of that. Moreover,
the ability to store and process XML and XQuery natively within a RDBMS enables the combination of different degrees of structure while searching and analyzing the underlying data.
New semantic technology takes this a step further by providing a set of formalized XML-based standards for storage, querying, and manipulation of data. Because of its heritage as part of the Semantic Web vision, it is not typically associated with Big Data
discussions, which in my mind is a big miss. While most NoSQL technologies fall into the categories of key value stores, graph, or document databases, the semantic resource description framework (RDF) triple store provides a different alternative. It is not
relational in the traditional sense, but still maintains relationships between data elements – including external ones – and does so in a flexible, extensible fashion.
A record in an RDF store is comprised of a "triple", consisting of subject-predicate-object. This does not impose a relational schema on the data, which supports the addition of new elements without structural modifications to the store. Additionally, the
underlying system can resolve references by inferring new triples from the existing records using a rules set. This is a powerful alternative to joining relational tables to resolve references in a typical RDBMS, while at the same time offering a more expressive
way to model data than a key value store.
Lastly, one of most powerful aspects of semantic technology came from the world of linguistics and Natural Language Processing (NLP), known as Entity Extraction. This capability provides a powerful mechanism to extract information from unstructured data
and combine it with transactional data, enabling deep analytics by bringing these worlds closer together.
In my next posts, I’ll move beyond how institutions can best manage the vast amounts and variety of data, and focus on how to harness the true value of Big Data in a financial landscape where speed in analytics and insight is increasingly critical.