How can we guide ChatGPT to generate financial factors, using volume-price divergence in FinRL as an example?
Factors are an important issue in financial markets. However, calculating these factors requires a significant amount of effort. ChatGPT is a large language model that can help us calculate factors based on our ideas, such as volume-price divergence. Additionally, ChatGPT can write comments for algorithms.
While it may not be possible to use ChatGPT to develop factors, we can assign more menial tasks such as writing code to ChatGPT. In this article, we will encourage ChatGPT to write code for the volume-price divergence factor and refine it based on our instructions. The key to guiding ChatGPT is : Let’s think step by step.
The outline of this article is as follows.
Part 1: Guide ChatGPT to write a program that utilizes multiprocessing and tqdm. This program should open multiple processes, with each process displaying a progress bar using tqdm. All of these progress bars should update simultaneously.
Part 2: Guide ChatGPT to write code for calculating the volume-price divergence factor. First, we’ll start by guiding ChatGPT to write a conventional volume-price divergence factor. Then, we’ll guide ChatGPT to write a modified version of the factor that utilizes non-linear calculations.
The conversation usually has three stages: question (warmup), question-again, and end.
Part 1: Guide ChatGPT to write a program that utilizes multiprocessing and tqdm.
The following is the conversation .
I: In Python, I want to use multiprocessing to download data simultaneously and display progress bar by ‘tqdm’. How to write the code?
ChatGPT:
Sure, here’s an example code that uses multiprocessing
to download data from a given URL in parallel and displays a progress bar using tqdm
:
import requests
from multiprocessing import Pool
from tqdm import tqdm
def download(url):
# Define block size for reading data in chunks
block_size = 1024
# Send a GET request to the URL and open a file with the same name
# as the last component of the URL to write the response content to
with requests.get(url, stream=True) as response, \
open(url.split('/')[-1], 'wb') as file:
# Retrieve the total file size from the response headers
file_size = int(response.headers.get('content-length', 0))
# Initialize a tqdm progress bar with the total file size and start downloading
with tqdm(total=file_size, unit='B', unit_scale=True, desc=url.split('/')[-1]) as progress_bar:
for chunk in response.iter_content(chunk_size=block_size):
if chunk:
# Write the chunk to the file and update the progress bar
file.write(chunk)
progress_bar.update(len(chunk))
if __name__ == '__main__':
# Define URL for data file
url = 'https://url.to/data_file'
# Create a pool of processes with two workers
with Pool(processes=2) as pool:
# Map download function to URL in parallel and use tqdm to track progress
for _ in tqdm(pool.imap_unordered(download, [url, url]), total=2, desc='Overall Progress'):
pass
In this code, the download
function is defined to download a data file from a given URL using the requests
module and save it to a file with the same name as the URL's last path component. The function also initializes a tqdm
progress bar with the total file size and updates the progress bar for each downloaded chunk.
The Pool
class from the multiprocessing
module is used to create a pool of two processes, and the imap_unordered
method is used to apply the download
function to the URL in parallel. The tqdm
function is used to track the overall progress of the download task by wrapping around the pool.imap_unordered
method. Since there are only two download tasks running in parallel, the total number of iterations for tqdm
is set to 2.
I: In the download function, please add “time.sleep(0.05)” to simulate the download process. Could you please revise the code?
ChatGPT:
Sure, here’s the modified code that adds a sleep time of 0.05 seconds in the download
function to simulate the download process:
import requests
import time
from multiprocessing import Pool
from tqdm import tqdm
def download(url):
# Define block size for reading data in chunks
block_size = 1024
# Send a GET request to the URL and open a file with the same name
# as the last component of the URL to write the response content to
with requests.get(url, stream=True) as response, \
open(url.split('/')[-1], 'wb') as file:
# Retrieve the total file size from the response headers
file_size = int(response.headers.get('content-length', 0))
# Initialize a tqdm progress bar with the total file size and start downloading
with tqdm(total=file_size, unit='B', unit_scale=True, desc=url.split('/')[-1]) as progress_bar:
for chunk in response.iter_content(chunk_size=block_size):
if chunk:
# Write the chunk to the file and update the progress bar
file.write(chunk)
progress_bar.update(len(chunk))
# Simulate download time by sleeping for 0.05 seconds
time.sleep(0.05)
if __name__ == '__main__':
# Define URL for data file
url = 'https://url.to/data_file'
# Create a pool of processes with two workers
with Pool(processes=2) as pool:
# Map download function to URL in parallel and use tqdm to track progress
for _ in tqdm(pool.imap_unordered(download, [url, url]), total=2, desc='Overall Progress'):
pass
In this code, the time
module is imported, and a sleep time of 0.05 seconds is added in the download
function after each chunk is downloaded to simulate the download process.
I run the code in PyCharm, and the fig shows the result.
I: You used two urls and they are the same. However, I want to use three different urls. Could you please revise the code?
ChatGPT:
Sure, here’s the modified code that downloads three different files from three different URLs simultaneously and displays progress bars using tqdm
:
import requests
import time
from multiprocessing import Pool
from tqdm import tqdm
def download(url):
# Define block size for reading data in chunks
block_size = 1024
# Send a GET request to the URL and open a file with the same name
# as the last component of the URL to write the response content to
with requests.get(url, stream=True) as response, \
open(url.split('/')[-1], 'wb') as file:
# Retrieve the total file size from the response headers
file_size = int(response.headers.get('content-length', 0))
# Initialize a tqdm progress bar with the total file size and start downloading
with tqdm(total=file_size, unit='B', unit_scale=True, desc=url.split('/')[-1]) as progress_bar:
for chunk in response.iter_content(chunk_size=block_size):
if chunk:
# Write the chunk to the file and update the progress bar
file.write(chunk)
progress_bar.update(len(chunk))
# Simulate download time by sleeping for 0.05 seconds
time.sleep(0.05)
if __name__ == '__main__':
# Define URLs for data files
urls = [
'https://url.to/data_file1',
'https://url.to/data_file2',
'https://url.to/data_file3'
]
# Create a pool of processes with one worker for each URL
with Pool(processes=len(urls)) as pool:
# Map download function to URLs in parallel and use tqdm to track progress
for _ in tqdm(pool.imap_unordered(download, urls), total=len(urls), desc='Overall Progress'):
pass
In this code, the urls
list is defined with three different URLs for three different data files. The Pool
is created with len(urls)
workers to process each download in parallel. The download
function is mapped to the URLs using pool.imap_unordered
to download the data files in parallel, and tqdm
is used to display progress bars for each download.
I run the code in PyCharm, and the fig shows the result.
ChatGPT uses auto-regression to generate models. As discussed in “Large Language Models are Zero-Shot Reasoners” , guiding ChatGPT step-by-step can improve the quality of its responses. Therefore, we’ll use the guiding principle of “Let’s think step by step” when using ChatGPT. This means we need to describe our thinking process when providing input to ChatGPT to achieve better results.
Part 2: Guide ChatGPT to write code for calculating the volume-price divergence factor.
First, warmup. We discuss financial factors, such as volume-price divergence, with ChatGPT. However, we don’t let ChatGPT write the code directly. Instead, we first discuss the topic with ChatGPT, and then allow it to generate the code based on our discussion.
The following is the conversation .
I: In finance, technical factors are important. Can you describe five factors, write their equations using Matlab, and write the Python codes? Please make sure that the equations are right.
ChatGPT:
Sure, here are five technical factors in finance, their equations in MATLAB, and example Python code for each:
- Moving Average Convergence Divergence (MACD)
MATLAB Equation:
[MACD,Signal] = macd(Close,12,26,9)
Python Code:
import pandas as pd
import talib
def calculate_macd(prices):
macd, signal, hist = talib.MACD(prices, fastperiod=12, slowperiod=26, signalperiod=9)
return macd, signal
2. Relative Strength Index (RSI)
MATLAB Equation:
RSI = 100 — (100 / (1 + RS))
RS = (average gain over N periods) / (average loss over N periods)
Python Code:
import numpy as np
import talib
def calculate_rsi(prices, n=14):
deltas = np.diff(prices)
seed = deltas[:n+1]
up = seed[seed >= 0].sum()/n
down = -seed[seed < 0].sum()/n
rs = up / down
rsi = talib.RSI(prices, timeperiod=n)
return rsi
3. Bollinger Bands
MATLAB Equation:
[upperBand,middleBand,lowerBand] = bollinger(Close,20,2)
Python Code:
import talib
def calculate_bollinger_bands(prices, n=20, k=2):
upper_band, middle_band, lower_band = talib.BBANDS(prices, timeperiod=n, nbdevup=k, nbdevdn=k)
return upper_band, middle_band, lower_band
4. On-Balance Volume (OBV)
MATLAB Equation:
OBV = obv(Close,Volume)
Python Code:
import pandas as pd
import talib
def calculate_obv(prices, volume):
obv = talib.OBV(prices, volume)
return obv
5. Stochastic Oscillator
MATLAB Equation:
[Stoch,StochMA] = stoch(High,Low,Close)
Python Code:
import talib
def calculate_stochastic(prices, high, low):
slowk, slowd = talib.STOCH(high, low, prices, fastk_period=5, slowk_period=3, slowd_period=3)
return slowk, slowd
Financial reinforcement learning library FinRL uses TAlib and StockStats to calculating factors. We can copy some factors to ChatGPT, and then guide it write new factors, which may be better. The new factors can be test in FinRL.
Second, encourage ChatGPT to be a professional financial factor developer, and guide it to design and write new factors.
The following is the conversation.
I: You are a professional developer for financial factors. Could you please develop a factor with high sharp ratio and good backtesting performance. Please write a function for the factor.
ChatGPT:
Sure, here’s an example of a factor that uses a combination of moving averages to identify trends in the price data and a momentum indicator to capture the strength of the trend:
import numpy as np
def moving_average(data, window):
weights = np.repeat(1.0, window) / window
return np.convolve(data, weights, mode='valid')
def momentum(data, period):
return np.divide(data, np.roll(data, period)) - 1
def trend_strength_factor(data, short_window=50, long_window=200, momentum_period=12):
# Calculate short and long moving averages
short_ma = moving_average(data, short_window)
long_ma = moving_average(data, long_window)
# Calculate momentum indicator
mom = momentum(data, momentum_period)
# Compute the difference between the short and long moving averages
diff = short_ma - long_ma
# Compute the standard deviation of the difference
std_diff = np.std(diff)
# Compute the mean momentum over the period
avg_mom = np.mean(mom)
# Compute the factor value as the ratio of the momentum to the standard deviation of the difference
factor_value = avg_mom / std_diff
return factor_value
This factor calculates the difference between a short-term moving average and a long-term moving average and computes the standard deviation of the difference as a measure of the trend. It also calculates the momentum of the price data and computes the ratio of the momentum to the standard deviation of the difference to capture the strength of the trend.
To backtest the factor, you could apply it to historical price data and simulate a trading strategy based on the factor signals. This would involve setting thresholds for buying and selling based on the factor values and testing the performance of the strategy using metrics such as the Sharpe ratio.
I: Could you please develop a factor for stocks which is related to large trading volume? Please write a function for the factor.
ChatGPT:
Certainly! Here’s a function that calculates the average volume of the last 5 trading days and returns a boolean value indicating whether the current day’s volume is greater than the average volume.
import pandas as pd
def large_volume_factor(df):
"""
Returns a boolean indicating whether the current day's volume is greater than the average volume of the last 5 trading days.
Parameters:
df (pandas.DataFrame): A DataFrame containing daily stock data including volume.
Returns:
bool: True if the current day's volume is greater than the average volume of the last 5 trading days, False otherwise.
"""
last_5_days = df.iloc[-5:] # Get the last 5 days of data
avg_volume = last_5_days['Volume'].mean() # Calculate the average volume of the last 5 days
current_volume = df.iloc[-1]['Volume'] # Get the volume for the current day
return current_volume > avg_volume
This function takes in a pandas DataFrame containing daily stock data, including volume. It then calculates the average volume of the last 5 trading days and compares it to the volume for the current day. If the current day’s volume is greater than the average volume of the last 5 days, it returns True, otherwise it returns False.
I: I hope that the factor is based on all tick data for a trading day, and aggregates the data into daily OHLC (open-high-low-close) bars and then apply the same calculation as before. Could you please upgrade codes?
ChatGPT:
Sure, here is an upgraded version of the code that aggregates tick data into daily OHLC bars before calculating the volume factor:
import pandas as pd
def daily_volume_factor(tick_data):
# Convert tick data to DataFrame with columns: ["timestamp", "price", "volume"]
tick_df = pd.DataFrame(tick_data, columns=["timestamp", "price", "volume"])
# Set timestamp column as index
tick_df.set_index("timestamp", inplace=True)
# Resample tick data to daily frequency, aggregating OHLC and summing volume
ohlc_df = tick_df.resample("D").agg({"price": ["first", "max", "min", "last"], "volume": "sum"})
ohlc_df.columns = ["open", "high", "low", "close", "volume"]
# Calculate the daily volume factor
vol = ohlc_df["volume"]
high_low_diff = ohlc_df["high"] - ohlc_df["low"]
volume_factor = (vol / vol.mean()) * (high_low_diff / high_low_diff.mean())
return volume_factor
This function takes in tick data as input, converts it into a DataFrame with columns for timestamp, price, and volume, sets the timestamp column as the index, resamples the data to daily OHLC bars while summing the volume, and calculates the volume factor using the same formula as before but using the daily OHLC data.
Third, encourage ChatGPT to write more complex factors.
I: We need a new financial factor that captures the correlation between volume and price. Could you please write a function for the new factor?
ChatGPT:
Certainly, here is a function that calculates a new financial factor that captures the correlation between volume and price using a rolling window:
import pandas as pd
def rolling_correlation(df, window_size):
close = df['Close']
volume = df['Volume']
rolling_corr = close.rolling(window_size).corr(volume)
return rolling_corr
The function takes a pandas DataFrame as input, where the DataFrame contains the daily OHLC (open-high-low-close) prices as well as trading volume for the stock. The function also takes a window size parameter that determines the size of the rolling window used to calculate the correlation.
The function first extracts the daily closing price and trading volume data from the DataFrame. It then calculates the rolling correlation between the two variables using the rolling()
method in pandas with the corr()
method applied to each rolling window.
This factor captures the correlation between price and volume over time, and may be useful for identifying trends and patterns in the relationship between the two variables.
I: Please upgrade the factor, make sure that it can capture the non-linear relationship between price and volume. Please do not use machine learning. Could you please write the codes?
ChatGPT:
To capture the non-linear relationship between price and volume, we can use a method called rank correlation. One common rank correlation measure is the Spearman’s rank correlation coefficient.
Here’s an updated function that calculates the rolling Spearman’s rank correlation coefficient between price and volume:
import pandas as pd
from scipy.stats import spearmanr
def rolling_spearmanr(df, window_size):
close = df['Close']
volume = df['Volume']
rolling_spearmanr, _ = spearmanr(close.rolling(window_size).mean().values,
volume.rolling(window_size).mean().values)
return rolling_spearmanr
The function takes a pandas DataFrame as input, where the DataFrame contains the daily OHLC (open-high-low-close) prices as well as trading volume for the stock. The function also takes a window size parameter that determines the size of the rolling window used to calculate the correlation.
The function first calculates the rolling mean of the daily closing price and trading volume data using the rolling()
method in pandas. It then calculates the Spearman's rank correlation coefficient between the two variables using the spearmanr()
function from the scipy.stats
module.
This factor captures the non-linear relationship between price and volume over time, and may be useful for identifying trends and patterns in the relationship between the two variables. Note that this method does not require the use of machine learning algorithms.
Third, improve the complexity. Encourage ChatGPT to write more complex factors. We provide ideas, and ChatGPT writes codes.
I: Please use an improved method that can replace spearmanr. Please do not use machine learning methods. Please upgrade the codes.
ChatGPT:
One possible method to improve the correlation calculation between price and volume is to use the Rank-Biserial Correlation Coefficient. This coefficient measures the strength of association between a binary variable (in our case, whether the price increased or decreased) and a continuous variable (volume).
Here is an updated version of the function that uses the Rank-Biserial Correlation Coefficient:
import numpy as np
import pandas as pd
from scipy.stats import rankdata
from typing import Tuple
def volume_price_correlation(data: pd.DataFrame, lookback: int = 20) -> Tuple[pd.Series, pd.Series]:
"""
Calculate the correlation between volume and price over a lookback period.
Parameters:
data (pd.DataFrame): a DataFrame containing 'close' and 'volume' columns
lookback (int): the number of days to look back
Returns:
A tuple of two pandas Series:
- The first series contains the rank-biserial correlation coefficients for each day
- The second series contains the p-values associated with each coefficient
"""
# Calculate the daily price changes
prices = data['close']
changes = prices.pct_change().shift(-1)
up_down = np.sign(changes)
# Calculate the rank-biserial correlation coefficients for each day
corr_list = []
pval_list = []
for i in range(lookback, len(data)):
volume = data['volume'][i-lookback:i]
rank_changes = rankdata(changes[i-lookback:i])
rank_changes = rank_changes / float(len(rank_changes))
corr, pval = stats.rankbiserialr(rank_changes, volume)
corr_list.append(corr)
pval_list.append(pval)
# Convert the lists to pandas Series and return them
corr_series = pd.Series(corr_list, index=data.index[lookback:])
pval_series = pd.Series(pval_list, index=data.index[lookback:])
return corr_series, pval_series
Note that the changes are still calculated using the percentage change between consecutive days, but now the binary variable is whether the price increased or decreased. We also calculate the rank of the changes within the lookback period and normalize them, then calculate the rank-biserial correlation coefficient using the rankbiserialr()
function from the scipy.stats
module.
This function returns a tuple of two pandas Series: the first one contains the rank-biserial correlation coefficients for each day, and the second one contains the p-values associated with each coefficient.
Summary
In this article, we first encourage ChatGPT to generate five common financial factors, then guide it to develop a new factor, and finally improve upon the volume-price divergence factor by taking it through a step-by-step process. ChatGPT provides two non-linear and non-machine learning methods for volume-price divergence: Spearman Rank and Rank-Biserial Correlation Coefficient. By guiding ChatGPT through the code-writing process step-by-step, it can be a useful tool in developing financial factors. In this way, we can obtain factors easily and incorporate the codes into FinRL.