Leveraging Sparse and Dense Vectors for Real-Time Financial Market Analysis

Gayathri Saranathan
16 min readSep 2, 2024

--

In the financial markets domain, real-time data processing is crucial for making informed decisions. Investors, analysts, and trading algorithms depend on timely insights derived from a vast array of data sources, including stock prices, economic indicators, news articles, and social media feeds. As the volume of financial data continues to grow, traditional analysis techniques are being augmented by advanced machine learning methods, particularly those leveraging vector representations of data.

The role of vector search in analyzing financial data: Vector search has emerged as a powerful tool for analyzing financial data. It enables the retrieval of relevant information based on vector similarity rather than exact keyword matches. This approach is particularly effective when dealing with high-dimensional data, such as financial news embeddings or sentiment analysis results. Vector search can uncover hidden relationships in data that might be missed by traditional methods, providing a more nuanced understanding of market dynamics.

Introduction to Qdrant’s capabilities for vector search: Qdrant is a high-performance vector search engine that supports both sparse and dense vector searches. It allows users to store, index, and query vectors efficiently, making it ideal for real-time financial market analysis. With Qdrant, you can combine different types of vector representations to perform complex queries that integrate various data sources, enhancing the overall analysis.

Sparse vs. Dense Vectors in Financial Data:

  • Sparse vectors are typically used to represent data in a format where most of the elements are zero. In the context of financial data, sparse vectors can be created using keyword-based features or sector-specific indicators. For example, a sparse vector could represent the presence or absence of certain keywords in a news article, market indices, or the performance of specific scores relative to the overall market.
  • Dense vectors, on the other hand, are used to represent data in a continuous, high-dimensional space. They are often generated using machine learning models like word embeddings or sentence transformers. In financial data analysis, dense vectors can be used to encode the meaning of entire news articles, sentiment scores, or other complex features that are not easily captured by sparse vectors.
Workflow for Leveraging Hybrid Vectors through Qdrant for Market Analysis

Combining Sparse and Dense vectors to enhance financial market analysis: Combining sparse and dense vectors allows for a more comprehensive analysis of financial data. Sparse vectors can capture specific, targeted features, while dense vectors can provide a broader, more contextual understanding. By integrating both types of vectors, you can perform more sophisticated queries that take into account both the fine-grained details and the overall context of the data. Here we show the example of combining Stock market data (Sparse) and relevant news information (Dense) to perform a hybrid search with Qdrant to obtain relevant information pertaining to a query.

Prerequisites:

Tools and Libraries:

Before diving into the implementation, ensure you have the following tools and libraries installed:

  • Python installed on your machines.
  • Basic understanding of open-source LLM inferencing, prompting, and vector stores.
  • Qdrant — Copy URL and API Key.
    > qdrant-client
    : Qdrant client for Python.
  • Transformers: Library for creating dense embeddings from text data.
  • Understanding of financial market data sources (e.g., stock prices, news feeds).

Setting the API Keys As Environment Variables:

API keys are unique identifiers, and developers often need to protect these as it can pose security risks if not managed properly. A common practice is to store these keys in environment variables. Here, we use a Python library called `python-dotenv` to manage the environment variables. It will help host an `.env` file to securely store the API keys and show how to access these values in Python programs.

Step 1: Installing the Library

pip install python-dotenv

Step 2: Creating an `.env` File

Create an `.env` file in the root directory of our project. It is a file that stores key-value pairs, representing an environment variable and its value.

GROQ_API_KEY = `your-groq-api-key`
QDRANT_API_KEY = `your-qdrant-api-key`
QDRANT_API_URL = `your-qdrant-url`
COHERE_API_KEY = `your-cohere-api-key`

Replace `your-groq-api-key`, `your-qdrant-api-key`, `your-qdrant-url`, `your-cohere-api-key` with your actual API keys.

NOTE: It is essential to add your `.env ` file in your `.gitignore` file to ensure it doesn’t reflect in your version control system.

Step 3: Updating `.gitignore`

Open your `.gitignore` file (create one if it doesn’t exist). Add `.env` file on a new line and save the `.gitignore` file

Step 4: Using the `.env` File in Python Code

Now that we have set up our `.env` file, we will start using it in our Python code.

from dotenv import load_dotenv
load_dotenv()

In the tutorial, we will see how to use these stored API keys in the `.env` file for our application.

Step 5: Creating a Conda Environment or Python venv, in command prompt/terminal

If Conda:

conda create - name adv_rag
conda install pip
pip install -r requirements.txt

If Python venv:

python3 -m venv .adv_rag
source .venv/bin/activate
python install -r requirements.txt

Dataset Preparation:

For a comprehensive analysis, it’s essential to select datasets that provide both numerical and textual data. For instance, we have chosen a combination of:

Historical Stock Prices: Daily or intraday data for various stocks, which can be obtained from financial databases like Yahoo Finance or Alpha Vantage.

Financial News Articles: News articles related to the stock market, which can be sourced from news aggregators like Google News, or specific financial news providers like Bloomberg or Reuters.

Once you have the raw data, the next step is to clean and preprocess it. This includes handling missing values, normalizing numerical data, and preparing text data for vectorization.

For Stock Prices:

  • Handle missing values by forward-filling or removing them.
  • Normalize prices or returns to bring them into a common scale

For Financial News:

  • Remove stop words, punctuation, and perform tokenization.
  • You may also want to filter out non-relevant articles based on keywords or use Named Entity Recognition (NER) to focus on specific companies or sectors.
  • Converting financial news into dense embeddings and market data into sparse vectors.

After preprocessing, convert the financial news into dense embeddings and the market data into sparse vectors. These representations will later be used for vector search and analysis.

Sparse Vectors for Market Data: Create sparse vectors using key financial indicators such as sector-specific performance, keyword frequency, or other categorical features.

Dense Embeddings for Financial News: Utilize a pre-trained model like BERT or Sentence Transformers to generate dense vectors for each news article.

Proper dataset preparation is crucial for effective analysis. By carefully selecting, cleaning, and transforming your financial data into both sparse and dense vectors, you set the foundation for advanced vector-based searches and analyses. These steps enable the integration of various data types, enhancing the robustness of your financial market insights.

The dataset has been provided in the /data directory in this repository. This dataset has been created for Apple and Microsoft financial data using this code. If you would like to create the dataset from other domains, please follow the code below and make the necessary changes as required.

Obtaining Market Data for Sparse Vectors:

To create financial indicators from your stock data (which includes columns like date, open, high, low, close, adj close, and volume), you can derive a range of commonly used technical indicators that help in analyzing stock performance. These indicators can then be used to generate sparse vectors.

  1. Here we are fetching historical stock prices from Yahoo Finance
import yfinance as yf

2. Define the stock ticker and time period, here we have obtained the data from 2023 to 2024

ticker = 'AAPL'
apple_stock = yf.download(ticker, start='2023–01–01', end='2024–05–01')
print(apple_stock.head())

The data looks like this:

| Date | Open | High | Low | Close | Adj Close | Volume | Company |
| ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: |
| 2023-01-03 | 130.279999 | 130.899994 | 124.169998 | 125.070000 | 123.904617 | 112117500 | Apple |
| 2023-01-04 | 126.889999 | 128.660004 | 125.080002 | 126.360001 | 125.182610 | 89113600 | Apple |
| 2023-01-05 | 127.129997 | 127.769997 | 124.760002 | 125.019997 | 123.855103 | 80962700 | Apple |

3. Calculate Technical Indicators

  • Moving Averages: Simple Moving Average (SMA), Exponential Moving Average (EMA).
  • Relative Strength Index (RSI).
  • Bollinger Bands.
  • Moving Average Convergence Divergence (MACD).
  • On-Balance Volume (OBV).

Now that you have the indicators, you can create sparse vectors by selecting the most relevant indicators. Here’s an example of how to create sparse vectors:

import numpy as np
from scipy.sparse import csr_matrix
from copy import deepcopy
df = deepcopy(apple_stock)
# Simple Moving Average (SMA)
df['SMA_20'] = df['Close'].rolling(window=20).mean()
# Exponential Moving Average (EMA)
df['EMA_12'] = df['Close'].ewm(span=12, adjust=False).mean()
df['EMA_26'] = df['Close'].ewm(span=26, adjust=False).mean()
# Relative Strength Index (RSI)
delta = df['Close'].diff(1)
gain = np.where(delta > 0, delta, 0)
loss = np.where(delta < 0, -delta, 0)
avg_gain = pd.Series(gain).rolling(window=14).mean()
avg_loss = pd.Series(loss).rolling(window=14).mean()
rs = avg_gain / avg_loss
df['RSI'] = 100 - (100 / (1 + rs))
# MACD
df['MACD'] = df['EMA_12'] - df['EMA_26']
df['Signal_Line'] = df['MACD'].ewm(span=9, adjust=False).mean()
# Bollinger Bands
df['Middle_Band'] = df['Close'].rolling(window=20).mean()
df['Upper_Band'] = df['Middle_Band'] + 2*df['Close'].rolling(window=20).std()
df['Lower_Band'] = df['Middle_Band'] - 2*df['Close'].rolling(window=20).std()
# On-Balance Volume (OBV)
df['OBV'] = (np.sign(df['Close'].diff()) * df['Volume']).fillna(0).cumsum()
# Volume Moving Average
df['Volume_MA'] = df['Volume'].rolling(window=20).mean()
# Normalize selected indicators
df['SMA_20_norm'] = df['SMA_20'] / df['Close']
df['RSI_norm'] = df['RSI'] / 100
df['MACD_norm'] = df['MACD'] / df['Close']
# Create sparse vectors using selected indicators
sparse_vectors = df[['SMA_20_norm', 'RSI_norm', 'MACD_norm', 'OBV']].fillna(0).values
print("Sparse Vectors:\n", sparse_vectors.shape)
sparse_matrix = csr_matrix(sparse_vectors)
# Show sparse matrix
print(sparse_matrix)

Obtaining Financial News for Dense Vectors:

Creating dense vectors from your news articles involves several steps, primarily using a pre-trained language model to generate embeddings (dense vectors) from the text. Here’s how you can do it using the sentence-transformers library in Python, which provides an easy way to convert text into dense vectors. The data was collected by scraping news from Reuters, CNBC, Bloomberg, Forbes, and Business Today.

  1. Here’s a script that uses newspaper3k to scrape recent news articles related to Apple:
import newspaper
from newspaper import Article
import pandas as pd
from datetime import datetime
# Define news sources to scrape
news_sources = [
"https://www.reuters.com/technology/",
"https://www.cnbc.com/technology/",
"https://www.bloomberg.com/technology"
"https://www.forbes.com/sites/technology/"
"https://www.businesstoday.in/latest/economy/"]
# Keywords to filter articles
keywords = ["Apple", "iPhone","Apple Vision Pro","AAPL","MacBook", "iPad",]
# Date range for filtering
start_date = datetime(2022, 8, 1)
end_date = datetime(2024, 5, 1)
# Function to collect articles
def collect_articles(news_sources, keywords, start_date, end_date):
articles = []

for source in news_sources:
paper = newspaper.build(source, memoize_articles=False)
for article in paper.articles:
try:
article.download()
article.parse()

# Check if the article's publication date is within the desired range
if article.publish_date and start_date <= article.publish_date <= end_date:
# Check if the article contains any of the keywords
if any(keyword in article.text for keyword in keywords):
articles.append({
"title": article.title,
"date": article.publish_date,
"text": article.text,
"source": source})
except Exception as e:
print(f"Failed to download article: {e}")
return articles
# Collect articles
articles = collect_articles(news_sources, keywords, start_date, end_date)
# Convert to DataFrame for easier handling
df = pd.DataFrame(articles)
df.dropna(subset=['date'], inplace=True) # Drop articles without a publish date
df.to_csv('data/apple_financial_news.csv', index=False)

After obtaining the data, it looks like this:

| title | date | text | source |
| ----: | ----: | ----: | ----: |
| Apple announces new MacBook Air laptops with i... | 2024-03-04 | Apple on Monday announced new versions of its ... | https://www.cnbc.com/technology/ |
| Here's what Meta CEO Mark Zuckerberg has to sa... | 2024-02-14 | Meta CEO Mark Zuckerberg demonstrates an Oculu... | https://www.cnbc.com/technology/ |
| Apple's Vision Pro virtual reality headset lau... | 2024-02-02 | The first customer walks out of the Apple Stor... | https://www.cnbc.com/technology/ |
| Apple Vision Pro review: This is the future of... | 2024-01-30 | In this article AAPL Follow your favorite stoc... | https://www.cnbc.com/technology/ |
| Apple $3,499 Vision Pro headset now available ... | 2024-01-19 | Preorders for Apple 's $3,499 Vision Pro heads... | https://www.cnbc.com/technology/ |

To create dense vectors, you can use a pre-trained model like paraphrase-MiniLM-L6-v2 from the sentence-transformers library:

## Creating Dense Vectors for News Articles
from sentence_transformers import SentenceTransformer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
# Load the pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
news_articles = list(apple_news.text.values)
# Generate dense vectors for each article
dense_vectors = model.encode(news_articles)
dense_vectors.shape

Here’s a step-wise implementation for real-time financial market analysis using the Qdrant Vector Database.

Step-by-Step Implementation

Step 1: Setting Up Qdrant for Financial Data

  • Installation and Configuration of Qdrant: Docker is the recommended method for setting up Qdrant. Run the following in bash:
docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant
  • Assuming Qdrant is running, connect to it using the qdrant-client library in python to your localhost or remote server:
from qdrant_client import QdrantClient
# Connect to the Qdrant service
client = QdrantClient("http://localhost:6333")

Step 2: Implementing Real-Time Search with Sparse and Dense Vectors

Configuring Qdrant for handling both sparse and dense vectors: Qdrant supports the creation of collections that can store both sparse and dense vectors. Configuring Qdrant involves defining these collections and setting up indexing strategies.

Creating a collection and storing the sparse vector:

# Create a collection named "financial_data"
client.recreate_collection(
collection_name="financial_data",
vectors_config=VectorParams(
size=sparse_vectors.shape[1], # Dimensionality of the vector (SMA_20_norm, RSI_norm, MACD_norm)
distance=Distance.COSINE # Distance metric (can be COSINE, EUCLID, etc.)
)
)
for i, vector in enumerate(sparse_vectors):
client.upsert(
collection_name="financial_data",
points=[
{ "id": i+1,
"vector": vector.tolist(),
"payload": {"date": df['Date'].iloc[i],
"RSI" : df["RSI_norm"].iloc[i]} }])

Creating a collection and storing the dense vector:

# Create a collection named "financial_data"
client.recreate_collection(
collection_name="news_sentiment",
vectors_config=VectorParams( size=dense_vectors.shape[1], # Dimensionality of the vector (SMA_20_norm, RSI_norm, MACD_norm) distance=Distance.COSINE # Distance metric (can be COSINE, EUCLID, etc.)
))
for i, vector in enumerate(dense_vectors):
client.upsert(
collection_name="news_sentiment",
points=[
{
"id": i+1,
"vector": vector.tolist(),
"payload":{"title":apple_news.title.iloc[i],
"Date":apple_news.date.iloc[i]} } ] )

Search, Retrieval and Result Interpretation

After storing the sparse and dense vectors, you can search and retrieve the information with the following code:

query_vector = np.array([1, 1, 0, 0])
results_sparse = client.search(
collection_name="financial_data",
query_vector=query_vector.tolist(),
limit=4, )
print("Market Indicators Results:", results_sparse)

The retrieved information from Sparse looks like this:

Market Indicators Results: [ScoredPoint(id=326, version=325, score=2.0630735e-09, payload={‘date’: ‘2024–04–19’, ‘RSI’: 0.4106571959163894}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=20, version=19, score=1.9640336e-09, payload={‘date’: ‘2023–01–31’, ‘RSI’: 0.8029493541256731}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=327, version=326, score=1.916854e-09, payload={‘date’: ‘2024–04–22’, ‘RSI’: 0.44604314177506993}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=325, version=324, score=1.8119324e-09, payload={‘date’: ‘2024–04–18’, ‘RSI’: 0.4194484246240115}, vector=None, shard_key=None, order_value=None)]

Similarly, dense vectors can be queried to analyze financial news sentiment.

query_vector = model.encode("Apple stock rises due to new product launch")
results_dense = client.search(
collection_name="news_sentiment",
query_vector=query_vector.tolist(),
limit=4)
print("News Sentiment Results:", results_dense)

The retrieved information searched through the dense vector looks like:

News Sentiment Results: [ScoredPoint(id=1, version=0, score=0.5135529, payload={‘title’: ‘Apple announces new MacBook Air laptops with its latest M3 chip’, ‘date’: ‘2024–03–04’}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=6, version=5, score=0.4692673, payload={‘title’: ‘Apple reportedly plans big overhaul to iPad family to make it less confusing’, ‘date’: ‘2023–12–11’}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=8, version=7, score=0.3633366, payload={‘title’: ‘Apple iPhone 14 gets another free year of satellite Emergency SOS’, ‘date’: ‘2023–11–15’}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=4, version=3, score=0.35279948, payload={‘title’: ‘Apple Vision Pro review: This is the future of computing and entertainment’, ‘date’: ‘2024–01–30’}, vector=None, shard_key=None, order_value=None)]

Merging Dense and Sparse Vectors

To combine insights from both sparse and dense vectors, you can merge or weigh the results based on relevance, score, or context.

# Combine results based on relevance or score
def combine_results(results_sparse,results_dense):
combined_results = {
"market_indicators": results_sparse, # From sparse vector query
"news_sentiment": results_dense # From dense vector query
}
#print("Combined Results:", combined_results)
return combined_results
def analyze_combined_results(combined_results):
"""
Analyzes the combined results from sparse and dense vector queries.
Parameters:
combined_results (dict): Dictionary containing search results from sparse and dense vector queries.
Example structure:
{
"market_indicators": [ … ],
"news_sentiment": [ … ]
}
Returns:
dict: A summary of the analysis, including key insights.
"""
# Extract results
market_results = combined_results.get('market_indicators', [])
news_results = combined_results.get('news_sentiment', [])
# Analyze market indicators
market_insights = []
for result in market_results:
market_insights.append({
"date": result.payload.get('date',"N/A"),
"score": result.score, # Relevance score
"id": result.id,
"indicator_vector": result.vector, # The sparse vector itself
"RSI": result.payload.get("RSI","N/A")
})
# Analyze news sentiment
news_insights = []
for result in news_results:
news_insights.append({
"headline": result.payload.get('title', 'N/A'),
"score": result.score, # Relevance score
"id": result.id,
"date": result.payload.get('date', 'N/A'),
"sentiment_vector": result.vector # The dense vector itself
})
# Combine insights for a summary
analysis_summary = {
"market_insights": market_insights,
"news_insights": news_insights,
"combined_summary": f"Top market indicator on {market_insights[0]['date']} with relevance score {market_insights[0]['score']} with an RSI of about {market_insights[0]["RSI"]}."
f" Associated news headline: '{news_insights[0]['headline']}' dated {news_insights[0]['date']} with sentiment score {news_insights[0]['score']}."}
return analysis_summary

Using the following code, we can generate real-time trading signals by querying both sparse and dense vectors.

def generate_trading_signal(stock_query, sentiment_query):
market_results = client.search(collection_name="financial_data", query_vector=stock_query.tolist(), limit=5)
sentiment_results = client.sPyearch(collection_name="news_sentiment", query_vector=sentiment_query.tolist(), limit=5)
# Combine or analyze results to generate a signal
signal = combine_results(market_results, sentiment_results)
analysis_summary = analyze_combined_results(signal)
return analysis_summary
# Example use
stock_query = np.array([1, 1, 0, 0])
sentiment_query = model.encode("Apple stock rises due to new product launch")
signal = generate_trading_signal(stock_query, sentiment_query)
print("Generated Trading Signal:", signal.get("combined_summary"))

The combined results obtained from this is given below:

Generated Trading Signal: Top market indicator on 2024–04–19 with relevance score 2.0630735e-09 with an RSI of about 0.410. Associated news headline: ‘Apple announces new MacBook Air laptops with its latest M3 chip’ dated 2024–03–04 with sentiment score 0.5135529. Hybrid Search in Financial Analytics Using Qdrant’s Hybrid Search Capabilities Qdrant supports hybrid search, which combines both sparse and dense vectors for nuanced insights.

The trading signal on 2024–04–19 shows a very low relevance score but aligns with a period when the RSI was neutral and after positive news regarding Apple’s product launch. This suggests that while the RSI alone might not indicate a strong trading opportunity, the combination with positive news sentiment could influence the stock’s future movement. This approach helps in making more informed trading decisions by understanding how market indices and external textual factors like news interact with each other. For example, we can observe that the intimation about the launch of M3 Chip, which was published in March 2024, has resulted in stock price improvement in April 2024.

A combination of these vectors have been integrated into Qdrant, allowing you to perform complex queries that combine financial indicators with news sentiment, providing comprehensive insights for real-time financial market analysis.

Step 3: Optimizing Performance for Real-Time Applications

Memory and Processing Efficiency:

  • When dealing with large datasets, it’s crucial to optimize memory usage. This can be done by reducing the dimensionality of vectors, batching queries, or using memory-efficient data structures.
  • Best practices for ensuring real-time performance are as follows:
  • Use Indexing: Proper indexing ensures fast retrieval.
  • Parallel Processing: Use parallel processing for handling multiple queries simultaneously.
  • Caching: Cache frequently accessed results to minimize redundant processing.

Scaling the System for Large-Scale Financial Data:

  • Leverage cloud-native features such as auto-scaling, load balancing, and distributed storage.

Step 4: Advanced Query Techniques

Hybrid Search in Financial Analytics:

Qdrant supports hybrid search, which combines both sparse and dense vectors for nuanced insights.

The provided code implements a hybrid search system using Qdrant, which combines both sparse and dense vector representations for enhanced search capabilities. The setup is particularly useful in scenarios like financial news analysis, where both the exact match (sparse vectors) and semantic similarity (dense vectors) are important for retrieving relevant information.

  • Model Initialization:

client.set_model(“sentence-transformers/all-MiniLM-L6-v2”): Initializes the dense vector model using all-MiniLM-L6-v2, a lightweight model from the Sentence Transformers library that creates dense embeddings.

client.set_sparse_model(“prithivida/Splade_PP_en_v1”): Initializes the sparse vector model using Splade_PP_en_v1, a model that provides sparse representations based on specific word importance.

  1. Collection Creation:
  • The code checks if the collection “hybrid_search” exists in Qdrant. If it doesn’t, it creates one using create_collection. Both dense and sparse vector configurations are specified here, enabling the hybrid search.

2. Data Ingestion:

  • The financial news data (from the apple_news dataset) is processed by extracting the text for vectorization and associated metadata (title, text, and date).
  • The client.add() method ingests the documents into the “hybrid_search” collection.
from tqdm import tqdm
client.set_model("sentence-transformers/all-MiniLM-L6-v2")
# comment this line to use dense vectors only
client.set_sparse_model("prithivida/Splade_PP_en_v1")
if not client.collection_exists("hybrid_search"):
client.create_collection(
collection_name="hybrid_search",
vectors_config=client.get_fastembed_vector_params(),
# comment this line to use dense vectors only
sparse_vectors_config=client.get_fastembed_sparse_vector_params(),
)
documents = list(apple_news["text"].values)
metadata = list(apple_news[["title","text","date"]])
client.add(
collection_name="hybrid_search",
documents=documents,
parallel=0, # Use all available CPU cores to encode data.
ids=tqdm(range(len(documents)))
)

3. Hybrid Searcher Class:

  • The HybridSearcher class is defined to encapsulate the search logic, which can be found in hybrid_search.py file. It initializes the Qdrant client, sets up both the dense and sparse models, and defines a search method to query the collection.
  • The search method performs a query on the collection using the input text and returns the top 5 results with their metadata (e.g., title, text, date).
# hybrid_search.py
from qdrant_client import QdrantClient
class HybridSearcher:
DENSE_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
SPARSE_MODEL = "prithivida/Splade_PP_en_v1"
def __init__(self, collection_name):
self.collection_name = collection_name
# initialize Qdrant client
self.qdrant_client = QdrantClient("http://localhost:6333")
self.qdrant_client.set_model(self.DENSE_MODEL)
# comment this line to use dense vectors only
self.qdrant_client.set_sparse_model(self.SPARSE_MODEL)
def search(self, text: str):
search_result = self.qdrant_client.query(
collection_name=self.collection_name,
query_text=text,
query_filter=None, # If you don't want any filters for now
limit=5, # 5 the closest results
)
# `search_result` contains found vector ids with similarity scores
# along with the stored payload
# Select and return metadata
metadata = [hit.metadata for hit in search_result]
return metadata

4. FastAPI Integration:

  • A FastAPI application is created to expose the search functionality through an API. The search_db function accepts a query string and returns the search results in JSON format.

from fastapi import FastAPI
# The file where HybridSearcher is stored
from hybrid_search import HybridSearcher
app = FastAPI()
# Create a neural searcher instance
hybrid_searcher = HybridSearcher(collection_name="hybrid_search")
def search_db(q: str):
return {"result": hybrid_searcher.search(text=q)}

search_db("Apple is performing well")

Returns 5 relevant documents based on the search query.

Reranking and Fusion Strategies:

  • Rerank search results based on custom criteria such as time decay, relevance, or financial impact.
  • Using Reciprocal Rank Fusion (RRF) for combining diverse data sources.
  • RRF is a technique to combine different ranking results into a single, more accurate ranking.

This could be implemented using the following code:


def reciprocal_rank_fusion(results):
combined_rank = 0
for rank, result in enumerate(results, start=1):
combined_rank += 1 / rank
return combined_rank
rrf_score = reciprocal_rank_fusion([sparse_results, dense_results])
print(rrf_score)

Conclusion

In this article, we explored how combining sparse and dense vectors can significantly enhance financial market analytics. Sparse vectors, often representing technical indicators (like moving averages or volume), allow for a detailed analysis of historical stock data, focusing on quantifiable metrics. On the other hand, dense vectors, created using embeddings (e.g., from financial news articles or sentiment analysis), capture semantic information that can provide deeper insights into market movements influenced by external factors like news and macroeconomic trends.

By leveraging both sparse and dense vectors, we can create a hybrid model that provides a more holistic view of market dynamics. This dual approach helps bridge the gap between purely technical and sentiment-based analysis, improving decision-making and market prediction accuracy.

At the core of our approach is Qdrant, a powerful vector search engine optimized for both sparse and dense vector queries. Qdrant efficiently indexes and searches vectors, making it an ideal choice for real-time applications like financial market analytics. Its hybrid search capabilities allow us to combine multiple sources of data into a single query, ensuring that relevant insights are drawn from both historical and semantic contexts.

Next Steps and Further Exploration:

There are several promising directions in which this research and application can be expanded:

  1. Incorporating Alternative Data Sources:
  • Expanding beyond stock data and news articles, you could introduce additional data sources, such as social media trends (e.g., Twitter sentiment), macroeconomic indicators (inflation, interest rates), and geopolitical events, to enhance the predictive power of your analytics.
  • For example, including environmental, social, and governance (ESG) factors can provide insights into long-term market stability and ethical investment strategies.

2. Advanced Vector Fusion Techniques:

  • As you experiment further with Qdrant’s features, explore advanced query strategies like reciprocal rank fusion (RRF) or custom fusion methods that dynamically adjust the importance of different data sources based on market conditions.
  • Investigating Qdrant’s upcoming or advanced features, such as distributed indexing or custom scoring methods, can also enhance your system’s scalability and performance for larger datasets.

By continuously experimenting with these advanced features and incorporating new techniques, you can further refine the system and explore untapped opportunities in financial market analytics.

Appendix

Code Samples:

  • Complete Python code for setting up Qdrant with financial data can be found in this Github repository.

References and Further Reading:

  • Yahoo Finance API Documentation — Link
  • Reuters Financial News — Link
  • “A Survey on Vector Search Methods” — Link
  • Qdrant Official Documentation — Link
  • “Hybrid Approaches in Vector Space Models” — Link
  • Qdrant Hybrid Search — Link

--

--

Gayathri Saranathan

AI Researcher @ Hewlett Packard Labs | Foundation Model Research | Meta & Active Learning