For my first post, I thought I'd look at some of the issues that exist around managing trading data. After all, market trading generates phenomenal amounts of data. The NYSE, for example, saw more than 3.17 billion trades on the 18th of December 2015, with
a value of over $110 billion. The NASDAQ has more than 2,000 equities that are being actively traded. As each market trade takes place, banks and financial services institutions have to ingest and track these transactions over time. The volume of trades for
each equity can reach up to a few million ticks per single day, and exponentially more for equity options; multiply this across multiple stocks and exchanges, and the volume of data created over time is vast.
Investment banks rely on having this complete historical intraday market data to analyse trends for their positions in stocks, futures, options and forex markets. In order to provide accurate analyses, no market data can be lost and everything must be captured.
By collecting and storing high fidelity tick data, investment banks can better empower data scientists, quants and analysts to identify the trends and insights hidden in the vast amounts of market data, optimise their predictive algorithms and models, and
take proactive or immediate actions.
However, traditional technologies used for storing data are not able to keep up with the huge volumes of data that high frequency trading can generate; at the same time, the increased importance of data for decision making means that each transaction itself
has more value to the teams involved. So how can banking IT teams keep up?
The traditional approach to managing market tick data
In the past, market tick data was often stored in isolated silos across different systems within a bank or financial organisation. These silos will have grown over time as each desk or section of the organisation was saving data to meet their own needs.
Over time, these silos can grow beyond their original requirements to become critical to operations. These silos may also exist as each team has its own data requirements that are different to others; saving this data into one place was viewed as being beyond
the requirement of a single database, leading to multiple datasets being created.
However, this also creates complexity that makes it extremely difficult to build a holistic view of the market data that exists within the bank or hedge fund. In turn, this would often prevent the institution from getting a “single source of the truth” that
would simplify data management and improve use of this data in context. Without this single data source, it becomes extremely difficult to unlock market trend insights, build advanced predictive models and serve the right information to highly available trading
Traditional database management systems have found it difficult to scale alongside the volume of trades that are now taking place. New approaches to capturing and storing this data over time have to be considered. As the data needs to be collected as it
is created, a high data ingest rate is critical, while the data itself has to be available with low latency.
This is particularly challenging when the data itself is needed in multiple locations at the same time. Traditional relational and in-memory databases are too structurally fragile to serve such demands of high availability because they have a single point
of failure, an inherent limitation of a master-slave architecture. In these environments, one server has to act as the lead node within any cluster and manage where data is stored within the overall database. Capturing data has to go through this initial point
of control, so any failure can lead to hundreds or thousands of missed trade data points and incomplete records.
New volumes of data require new data architectures
In order to keep up with the volume of master tick data that is being created every day, it’s important to look at alternative database architectures to the traditional master-slave approach. Using a masterless or peer-to-peer architecture involves working
with data in a manner that is aware that there is no “lead node” and all the nodes are equally capable of performing operations. Using this distributed computing architecture, trade data is always replicated to multiple machines in real time.
In the background, a distributed computing architecture manages its data and operations so that data is spread across multiple nodes. This provides better business continuity and availability of data for the business, as the loss of any one node would not
affect the ability to process transactions or use data for analysis.
At the same time, nodes can be geographically dispersed as well - the masterless architecture enables data to be replicated across not just multiple machines within a data centre, but across multiple data centres in real time. This allows multiple locations
to be active and used to meet user requests. This active-active architecture allows banks to realise cost savings by distributing load across all their data centre locations, as opposed to the traditional active-passive model employed by legacy technologies.
By using multiple active-active data centres, it’s possible to direct users to the closest location to reduce latency. This can help improve performance over time, as the time-series data used within analytics is “fresher” compared to using data that has
had to come from the other side of the world. When companies base their competitive edge on issues like latency, this can provide a real benefit.
Consolidating storage and databases
For investment banks, it is not uncommon to have market tick data stored across dozens of various systems that have been inherited over time. Whether it is due to changes in suppliers, mergers and acquisitions, or different departments taking their own approach
to procuring storage, there can be a patchwork quilt of data suppliers and storage underneath that market tick data.
These systems are often used to serve different department needs such as risk management, portfolio analysis, trade flow and regulatory compliance systems. These requirements can also be in place across different geographies; for example, a team in Asia-Pacific
may need the same data for its own decision-making as is stored in the US or the UK. However, while each team may have data to meet its own particular requirements, this makes it difficult to create a clean data store let alone to gain a holistic view of the
data and use it for wider analysis.
Because of this data duplication, it can be difficult for teams within investment banks to get that complete picture of performance. When carrying out analytics and reporting, teams would run into information trust issues with data that is inconsistent depending
on where it is accessed from and who owns it.
Couple this with the brittle nature of traditional relational database management systems, and it’s not hard to spot why bank IT teams are considering their approach to data in the future. Building a single source of consolidated data can help reduce complexity
and cost for the overall operation, yet it can also be an opportunity to think ahead on how to accommodate the increasing demand that there is within banks around trading data.
Moving this data to a centralised platform does not therefore mean that all information exists in one place. Instead, implementing a distributed database platform can make more data available to everyone in a consistent way that reduces latency. At the same
time, moving data from existing legacy storage can help make that data available to new users and for new use cases that would previously have not been possible.
This migration process can be managed so that availability of information remains consistent. Implementing a distributed service layer to collate data from multiple systems into one place can make the migration process easier while leaving existing applications
in place. These silos can be brought together into one point, while the complex mainframe systems or enterprise applications can be replaced and decommissioned over time. This reduces the risk associated with migration over time.
For banks, looking at fully distributed database management systems can be an opportunity to reduce their spending on storage and database management systems. By taking out some of the redundant tools and storage platforms that exist at present, this can
also help bank IT teams improve their operations and provide better quality service back to the business.