Community
I recently wrote a Finextra piece entitled 3 GenAI Use Cases for Capital Markets; The Power of the Vector. In it, I discussed the increasing importance of the so-called vector database and vectors more generally to a whole range of quantitative finance applications.
The term vector database, as I discussed in that piece, carries multiple, overloaded meanings, like the words bat, flat, duck or prompt. With GenAI and so-called Large Language Models (LLMs), the term has come to hold specific meaning of a "memory store" centered around "vector embeddings," model-encoded outputs that follow prescribed mathematical vector formats and dimensions (magnitude, distance, etc) to allow easy indexing and search when run live. However, for me, as someone brought up in the "vectors as a stream of data to manipulate in one operation" which is how I and any R, MATLAB, NumPy, q, or Julia programmer would describe a vector native application (or data type), a vector database can mean something different. However, something vector native, like any of these MATLAB-like applications referenced, can hold the same vector embeddings too with sufficient memory.
But I'm not here to semantically deconstruct the term vector database, not this time at least. I do want to explore what happens within them though when you prompt. Perhaps you ask semantically, "describe my next cat in the style of Demis Roussos?" or "draw a picture of my future spouse standing by a bus stop," Your words are searched across those stored memories - vectors - and indexes of those vectors. It’s like finding a book in a big library, and the catalogs within. Such vectors are beautifully geared, through relatively simple to respond quickly with context (albeit a lot of compute when carried out at scale) to "similarity searches." Then the output gets compiled, the cat description built, lovely picture of your future spouse by the bus stop drawn, music created, code submitted, or whatever.
Thus vector embeddings aim to capture relevant semantic, contextual, or structural information, with embedding models employing techniques and algorithms appropriate to the type of data dealt with and key characteristics of data ultimately represented. Text embeddings, for example, capture semantic meaning of words and their relationships within a language. For example, they can encode semantic similarities between words, such as "king" being closer to "queen" than to "chicken" with "elvis" somewhere in between.
Embeddings technologies are not new, and mathematical vectors themselves are certainly pretty ancient. I will later refer to Euclid, a Greek gentleman from olden times. Thus the technologies on which Generative AI stand can be said to be on shoulders of giants, some ancient ones. A decade ago, and prior to the Christmas 2022 ChatGPT LinkedIn Pokemon-like craze, I ran a Natural Language Processing (NLP) sentiment demo that determined sentiment from Twitter feeds to paramterize trade decisions. I also used NLP to scrutinize the Madoff reports, looking for unusual patterns that signified his fraudulence - over-use of adjectives for example. Under the hood, we made use of the Word2Vec model, which stands for, you guessed it, Word to Vector. This creates dense vector representations that capture semantic relationships by training a neural network to predict words in context. Tools like Ravenpack News Analytics and MarketPsych were frequently used and predicated on such methods - others are and were available, but I recall these best - tested though perhaps not always production deployed (they didn't always work) on many trading desks. Good times were had by many amidst the NLP hype of a decade ago! But that's the same, or similar, vector thing that goes into your new GenAI-type vector database today or vector native processing environment, as I did with MATLAB a decade back.
Today, large language models (LLMs) offer "pre-trained" meaning, which you just run via a prompt, no need to build locally. There again, they are big, broad, generalized, models, way way way bigger managing more dimensions than the teeny tiny models I ran a decade ago. You can use them directly as you probably do with ChatGPT, or, if appropriately tokenized, take the model output vector embeddings into a vector database. This gives you control, to apply and augment with your own data, manage prompts, facilitate additional embeddings for new data, and, when managed well, apply "guardrails" against those hallucinations everyone warned you about on LinkedIn.
The embeddings and stores of meaning do matter, but for the remainder of this blog I want to focus on the searches that expedite meaning, create information, and, ideally, answer hard questions that add value to your organization. I equate such search and “similarity search”-type processes being like neurons kicking in, infusing the vector database with proper on-the-fly intelligence. The interesting thing here is that the traditional search and similarity search techniques - or neurons as I think of them - are not new to finance, or to anyone who has used a search engine, deployed a tool like ElasticSearch, Solr or the Lucene project that underpins them, or any sort of recommendation engine - think Netflix, Spotify, Amazon.
So let's dive in. Some maths will follow, but hopefully it gets explained simply enough.
As noted, by understanding the similarity between vectors, we understand similarity across the data objects themselves. Similarity measures help to understand relationships, identify patterns, and make informed decisions, for example:
The similarity measure you choose depends on the nature of the data and the specific application at hand. Your data scientists can best advise. I try to describe three commonly used measures, their strengths and weaknesses, and outline how I see them deployed in financial services. In my world, that's normally, given my experience, in quantitative finance, capital markets, risk management, and fraud detection. I'm not in any way suggesting you pick up a vector database tomorrow and change all your workflows, but I am trying to illuminate and de-mystify some of quite complicated mathematical names to show how, in plain terms, they're sensible, actually quite simple and pretty commonplace already.
1) Euclidean distance assesses the similarity of two vectors by measuring the straight-line distance between the two vector points. Vectors that are more similar will have a shorter absolute distance between them, while dissimilar vectors have a larger distance between one another. It understands distance as a combination of relative magnitude and direction, but when working with vector spaces higher than 2 or 3 dimensions (i.e. more than you can visualize on a regular 3 dimensional plot), there are certain ways, such as the "L2-norm" to help normalize.
Euclidean distance tends to apply to applications like:
2) The dot product is a simple measure used to see how aligned two vectors are with one another, a bit like a score. It tells us if the vectors point in the same direction, in opposite directions, or are perpendicular to each other. It is calculated by multiplying the corresponding elements of the vectors and adding up the results to get a single scalar number. It lends itself well to applications such as:
3) Cosine similarity measures the similarity of two vectors by using the angle between these two vectors. The magnitude of the vectors themselves does not matter and only the angle is considered in this calculation, so if one vector contains small values and the other contains large values, this will not affect the resulting similarity value.
Cosine similarity therefore, with its "similar vectors will likely point in the same direction" contrasts nicely with the Euclidean "as-the-crow-flies" distance. It thus apples well to use cases such as:
Now, there's much more to which I will return in a later blog - the role of indexes and index search, and the application of the other types of vectors I alluded to, the sequences of data, like time-series information, that can be operated on for speed, simplicity and efficiency. Some of this, I talk about in 3 GenAI Use Cases for Capital Markets; The Power of the Vector. But I shall return.
A final comment. It's okay to be confused by this stuff. I spoke with two exceptionally qualified quants this week. Both admitted to being completely overwhelmed by the changes taking place right now in our industry with GenAI. I totally feel the same way. On the flip side, the hype cycle obfuscates, and sometimes what lies beneath is shallower than it might appear. I hope my article helps simplify. Let me know.
With thanks to my colleagues Nathan Crone and Neil Kanungo. Their great article, How Vector Similarity Drives Contextual Search inspired this one. If there are faults in my interpretation, those faults are mine alone, and any opinions expressed are mine alone and not those of my employer. Thanks also to PJ O’Kane for his thoughtful review.
This content is provided by an external author without editing by Finextra. It expresses the views and opinions of the author.
Carlo R.W. De Meijer Owner and Economist at MIFSA
11 September
Ruchi Rathor Founder at Payomatix Technologies
10 September
Ahmad Almoosa Cofounder & CEO at Mazeed
Alex Kreger Founder & CEO at UXDA
Welcome to Finextra. We use cookies to help us to deliver our services. You may change your preferences at our Cookie Centre.
Please read our Privacy Policy.