“There is so much data out there, but I spend more time on reconciling, verifying and mapping it, rather than actually analysing it to enrich my decision-making process.”
Sound familiar? Then you’re not alone. It’s one of the most common complaints I hear when I speak to customers.
According to MongoDB, a leading cloud-based database software provider, up to 90 per cent of data generated worldwide is ‘unstructured’ – making it difficult to analyse. While that may suggest that ‘structured’ data should therefore be easier to deal
with, that’s not necessarily true, either.
A former mentor of mine once told me: “You will see every day a new data vendor coming up on the market, but when you start looking under the hood, there’s barely anyone who can take pride in the technology that helps users find that data.”
Whether data are structured, semi-structured or unstructured, there is no reason why they should be difficult to work with – so long as data providers follow eight principles for sourcing, integration, governance and usability:
1. Data connectivity. A database is not the sum of multiple tables. Each data point is connected, directly or indirectly, by common attributes. By enriching raw data observations with an extensive set of metadata attributes, you allow users to connect
the dots from a region or a concept to one another.
2. True-to-source. Algorithm-driven strategies must rely on the data that was published on the market at the time of the publication. To validate true-to-source data, you need to apply multiple layers of consistency and control checks to time-series
3. Unbiased / point-in-time information. For data to be true-to-source, it must also expose what was known at the time of publication to avoid any ex-post adjustment. The lack of point-in-time data can lead to look-ahead bias in a backtest or backcast
process – the nemesis of any quant – potentially resulting in false-positive signals in an investment strategy.
4. Standard naming conventions. The industry is always seeking to standardise the taxonomies of the instruments issued on the market (e.g., ISIN, CUSIP, SEDOL, MIC or ISO) to facilitate communication between systems and create a common reference point
or language. This has yet to be fully applied to macroeconomic concepts. Using an aliasing system and applying ISO conventions wherever possible, eases data navigation.
5. Hierarchical relationships. As the number of time series in a database grows almost exponentially, it becomes almost impossible to navigate in a data tree if there is no hierarchical structure. The use of extensive metadata allows users to categorise
each concept and connect one region to another. Also bear in mind that the data will be used and presented via business intelligence tools that offer a wide range of visualisation possibilities, so data connectivity (principle #1) and hierarchical relationships
make it easier to tell a story.
6. Aggregation / Sum of the parts. Macroeconomic concepts can often be aggregated to a holistic or total representation of this concept or drilled down into a more granular sector or type-of-goods classification. Adding various levels to a concept
enables a quick aggregation or drill-down into these concepts through the metadata described in the fifth principle.
7. Data pass-through. In my experience, there is no data provider that can cover the full spectrum of requirements to feed an investment strategy. For instance, everyone provides ESG data, but some highly specialised providers have a competitive advantage
over a more generalist provider. Capturing alternative data has been in the air for quite a few years now and investors and their providers not only refine their investment strategies but also constantly differentiate from a given benchmark. There is much
commitment to high-frequency (daily/weekly), granular (per city/state), sentiment (news feeds) and workflow-oriented (nowcast/forecast) initiatives. Having the ability to quickly integrate third-party data through add-ons significantly enlarges the scope of
data one can collect.
8. Data discovery. Offering extensive time series coverage is only beneficial if users can find that data easily and efficiently. Firms must find the right balance between making the data easily discoverable without concealing any part of the database.
They should also offer online access to a dynamic catalogue and coverage statistics so customers can integrate the numbers into their internal portals.
Based on my experience, what users want is maximum flexibility to create their own macro view of the world for their investment processes, consume feature-packed data, including point-in-time revisions and pay only for what is used on production (as opposed
to unlimited search via the application). And of course, the delivery of the data is key, with Web APIs playing a crucial part in any data and tech architecture.
 What is unstructured data? | MongoDB