How can we guide ChatGPT to generate financial factors, using volume-price divergence in FinRL as an example?

13 min readMar 3, 2023

Factors are an important issue in financial markets. However, calculating these factors requires a significant amount of effort. ChatGPT is a large language model that can help us calculate factors based on our ideas, such as volume-price divergence. Additionally, ChatGPT can write comments for algorithms.

While it may not be possible to use ChatGPT to develop factors, we can assign more menial tasks such as writing code to ChatGPT. In this article, we will encourage ChatGPT to write code for the volume-price divergence factor and refine it based on our instructions. The key to guiding ChatGPT is : Let’s think step by step.

The outline of this article is as follows.

Part 1: Guide ChatGPT to write a program that utilizes multiprocessing and tqdm. This program should open multiple processes, with each process displaying a progress bar using tqdm. All of these progress bars should update simultaneously.

Part 2: Guide ChatGPT to write code for calculating the volume-price divergence factor. First, we’ll start by guiding ChatGPT to write a conventional volume-price divergence factor. Then, we’ll guide ChatGPT to write a modified version of the factor that utilizes non-linear calculations.

The conversation usually has three stages: question (warmup), question-again, and end.

Part 1: Guide ChatGPT to write a program that utilizes multiprocessing and tqdm.

The following is the conversation .

I: In Python, I want to use multiprocessing to download data simultaneously and display progress bar by ‘tqdm’. How to write the code?

ChatGPT:

Sure, here’s an example code that uses multiprocessing to download data from a given URL in parallel and displays a progress bar using tqdm:

import requests
from multiprocessing import Pool
from tqdm import tqdm

def download(url):
    # Define block size for reading data in chunks
    block_size = 1024

    # Send a GET request to the URL and open a file with the same name
    # as the last component of the URL to write the response content to
    with requests.get(url, stream=True) as response, \
         open(url.split('/')[-1], 'wb') as file:

        # Retrieve the total file size from the response headers
        file_size = int(response.headers.get('content-length', 0))

        # Initialize a tqdm progress bar with the total file size and start downloading
        with tqdm(total=file_size, unit='B', unit_scale=True, desc=url.split('/')[-1]) as progress_bar:
            for chunk in response.iter_content(chunk_size=block_size):
                if chunk:
                    # Write the chunk to the file and update the progress bar
                    file.write(chunk)
                    progress_bar.update(len(chunk))

if __name__ == '__main__':
    # Define URL for data file
    url = 'https://url.to/data_file'

    # Create a pool of processes with two workers
    with Pool(processes=2) as pool:
        # Map download function to URL in parallel and use tqdm to track progress
        for _ in tqdm(pool.imap_unordered(download, [url, url]), total=2, desc='Overall Progress'):
            pass

In this code, the download function is defined to download a data file from a given URL using the requests module and save it to a file with the same name as the URL's last path component. The function also initializes a tqdm progress bar with the total file size and updates the progress bar for each downloaded chunk.

The Pool class from the multiprocessing module is used to create a pool of two processes, and the imap_unordered method is used to apply the download function to the URL in parallel. The tqdm function is used to track the overall progress of the download task by wrapping around the pool.imap_unordered method. Since there are only two download tasks running in parallel, the total number of iterations for tqdm is set to 2.

I: In the download function, please add “time.sleep(0.05)” to simulate the download process. Could you please revise the code?

ChatGPT:

Sure, here’s the modified code that adds a sleep time of 0.05 seconds in the download function to simulate the download process:

import requests
import time
from multiprocessing import Pool
from tqdm import tqdm

def download(url):
    # Define block size for reading data in chunks
    block_size = 1024

    # Send a GET request to the URL and open a file with the same name
    # as the last component of the URL to write the response content to
    with requests.get(url, stream=True) as response, \
         open(url.split('/')[-1], 'wb') as file:

        # Retrieve the total file size from the response headers
        file_size = int(response.headers.get('content-length', 0))

        # Initialize a tqdm progress bar with the total file size and start downloading
        with tqdm(total=file_size, unit='B', unit_scale=True, desc=url.split('/')[-1]) as progress_bar:
            for chunk in response.iter_content(chunk_size=block_size):
                if chunk:
                    # Write the chunk to the file and update the progress bar
                    file.write(chunk)
                    progress_bar.update(len(chunk))

                # Simulate download time by sleeping for 0.05 seconds
                time.sleep(0.05)

if __name__ == '__main__':
    # Define URL for data file
    url = 'https://url.to/data_file'

    # Create a pool of processes with two workers
    with Pool(processes=2) as pool:
        # Map download function to URL in parallel and use tqdm to track progress
        for _ in tqdm(pool.imap_unordered(download, [url, url]), total=2, desc='Overall Progress'):
            pass

In this code, the time module is imported, and a sleep time of 0.05 seconds is added in the download function after each chunk is downloaded to simulate the download process.

I run the code in PyCharm, and the fig shows the result.

I: You used two urls and they are the same. However, I want to use three different urls. Could you please revise the code?

ChatGPT:

Sure, here’s the modified code that downloads three different files from three different URLs simultaneously and displays progress bars using tqdm:

import requests
import time
from multiprocessing import Pool
from tqdm import tqdm

def download(url):
    # Define block size for reading data in chunks
    block_size = 1024

    # Send a GET request to the URL and open a file with the same name
    # as the last component of the URL to write the response content to
    with requests.get(url, stream=True) as response, \
         open(url.split('/')[-1], 'wb') as file:

        # Retrieve the total file size from the response headers
        file_size = int(response.headers.get('content-length', 0))

        # Initialize a tqdm progress bar with the total file size and start downloading
        with tqdm(total=file_size, unit='B', unit_scale=True, desc=url.split('/')[-1]) as progress_bar:
            for chunk in response.iter_content(chunk_size=block_size):
                if chunk:
                    # Write the chunk to the file and update the progress bar
                    file.write(chunk)
                    progress_bar.update(len(chunk))

                # Simulate download time by sleeping for 0.05 seconds
                time.sleep(0.05)

if __name__ == '__main__':
    # Define URLs for data files
    urls = [
        'https://url.to/data_file1',
        'https://url.to/data_file2',
        'https://url.to/data_file3'
    ]

    # Create a pool of processes with one worker for each URL
    with Pool(processes=len(urls)) as pool:
        # Map download function to URLs in parallel and use tqdm to track progress
        for _ in tqdm(pool.imap_unordered(download, urls), total=len(urls), desc='Overall Progress'):
            pass

In this code, the urls list is defined with three different URLs for three different data files. The Pool is created with len(urls) workers to process each download in parallel. The download function is mapped to the URLs using pool.imap_unordered to download the data files in parallel, and tqdm is used to display progress bars for each download.

I run the code in PyCharm, and the fig shows the result.

ChatGPT uses auto-regression to generate models. As discussed in “Large Language Models are Zero-Shot Reasoners” , guiding ChatGPT step-by-step can improve the quality of its responses. Therefore, we’ll use the guiding principle of “Let’s think step by step” when using ChatGPT. This means we need to describe our thinking process when providing input to ChatGPT to achieve better results.

Part 2: Guide ChatGPT to write code for calculating the volume-price divergence factor.

First, warmup. We discuss financial factors, such as volume-price divergence, with ChatGPT. However, we don’t let ChatGPT write the code directly. Instead, we first discuss the topic with ChatGPT, and then allow it to generate the code based on our discussion.

The following is the conversation .

I: In finance, technical factors are important. Can you describe five factors, write their equations using Matlab, and write the Python codes? Please make sure that the equations are right.

ChatGPT:

Sure, here are five technical factors in finance, their equations in MATLAB, and example Python code for each:

Moving Average Convergence Divergence (MACD)

MATLAB Equation:

[MACD,Signal] = macd(Close,12,26,9)

Python Code:

import pandas as pd
import talib

def calculate_macd(prices):
    macd, signal, hist = talib.MACD(prices, fastperiod=12, slowperiod=26, signalperiod=9)
    return macd, signal

2. Relative Strength Index (RSI)

MATLAB Equation:

RSI = 100 — (100 / (1 + RS))

RS = (average gain over N periods) / (average loss over N periods)

Python Code:

import numpy as np
import talib

def calculate_rsi(prices, n=14):
    deltas = np.diff(prices)
    seed = deltas[:n+1]
    up = seed[seed >= 0].sum()/n
    down = -seed[seed < 0].sum()/n
    rs = up / down
    rsi = talib.RSI(prices, timeperiod=n)
    return rsi

3. Bollinger Bands

MATLAB Equation:

[upperBand,middleBand,lowerBand] = bollinger(Close,20,2)

Python Code:

import talib

def calculate_bollinger_bands(prices, n=20, k=2):
    upper_band, middle_band, lower_band = talib.BBANDS(prices, timeperiod=n, nbdevup=k, nbdevdn=k)
    return upper_band, middle_band, lower_band

4. On-Balance Volume (OBV)

MATLAB Equation:

OBV = obv(Close,Volume)

Python Code:

import pandas as pd
import talib

def calculate_obv(prices, volume):
    obv = talib.OBV(prices, volume)
    return obv

5. Stochastic Oscillator

MATLAB Equation:

[Stoch,StochMA] = stoch(High,Low,Close)

Python Code:

import talib

def calculate_stochastic(prices, high, low):
    slowk, slowd = talib.STOCH(high, low, prices, fastk_period=5, slowk_period=3, slowd_period=3)
    return slowk, slowd

Financial reinforcement learning library FinRL uses TAlib and StockStats to calculating factors. We can copy some factors to ChatGPT, and then guide it write new factors, which may be better. The new factors can be test in FinRL.

Second, encourage ChatGPT to be a professional financial factor developer, and guide it to design and write new factors.

The following is the conversation.

I: You are a professional developer for financial factors. Could you please develop a factor with high sharp ratio and good backtesting performance. Please write a function for the factor.

ChatGPT:

Sure, here’s an example of a factor that uses a combination of moving averages to identify trends in the price data and a momentum indicator to capture the strength of the trend:

import numpy as np

def moving_average(data, window):
    weights = np.repeat(1.0, window) / window
    return np.convolve(data, weights, mode='valid')

def momentum(data, period):
    return np.divide(data, np.roll(data, period)) - 1

def trend_strength_factor(data, short_window=50, long_window=200, momentum_period=12):
    # Calculate short and long moving averages
    short_ma = moving_average(data, short_window)
    long_ma = moving_average(data, long_window)

    # Calculate momentum indicator
    mom = momentum(data, momentum_period)

    # Compute the difference between the short and long moving averages
    diff = short_ma - long_ma

    # Compute the standard deviation of the difference
    std_diff = np.std(diff)

    # Compute the mean momentum over the period
    avg_mom = np.mean(mom)

    # Compute the factor value as the ratio of the momentum to the standard deviation of the difference
    factor_value = avg_mom / std_diff

    return factor_value

This factor calculates the difference between a short-term moving average and a long-term moving average and computes the standard deviation of the difference as a measure of the trend. It also calculates the momentum of the price data and computes the ratio of the momentum to the standard deviation of the difference to capture the strength of the trend.

To backtest the factor, you could apply it to historical price data and simulate a trading strategy based on the factor signals. This would involve setting thresholds for buying and selling based on the factor values and testing the performance of the strategy using metrics such as the Sharpe ratio.

I: Could you please develop a factor for stocks which is related to large trading volume? Please write a function for the factor.

ChatGPT:

Certainly! Here’s a function that calculates the average volume of the last 5 trading days and returns a boolean value indicating whether the current day’s volume is greater than the average volume.

import pandas as pd

def large_volume_factor(df):
    """
    Returns a boolean indicating whether the current day's volume is greater than the average volume of the last 5 trading days.
    
    Parameters:
    df (pandas.DataFrame): A DataFrame containing daily stock data including volume.
    
    Returns:
    bool: True if the current day's volume is greater than the average volume of the last 5 trading days, False otherwise.
    """
    last_5_days = df.iloc[-5:] # Get the last 5 days of data
    avg_volume = last_5_days['Volume'].mean() # Calculate the average volume of the last 5 days
    current_volume = df.iloc[-1]['Volume'] # Get the volume for the current day
    
    return current_volume > avg_volume

This function takes in a pandas DataFrame containing daily stock data, including volume. It then calculates the average volume of the last 5 trading days and compares it to the volume for the current day. If the current day’s volume is greater than the average volume of the last 5 days, it returns True, otherwise it returns False.

I: I hope that the factor is based on all tick data for a trading day, and aggregates the data into daily OHLC (open-high-low-close) bars and then apply the same calculation as before. Could you please upgrade codes?

ChatGPT:

Sure, here is an upgraded version of the code that aggregates tick data into daily OHLC bars before calculating the volume factor:

import pandas as pd

def daily_volume_factor(tick_data):
    # Convert tick data to DataFrame with columns: ["timestamp", "price", "volume"]
    tick_df = pd.DataFrame(tick_data, columns=["timestamp", "price", "volume"])
    
    # Set timestamp column as index
    tick_df.set_index("timestamp", inplace=True)
    
    # Resample tick data to daily frequency, aggregating OHLC and summing volume
    ohlc_df = tick_df.resample("D").agg({"price": ["first", "max", "min", "last"], "volume": "sum"})
    ohlc_df.columns = ["open", "high", "low", "close", "volume"]
    
    # Calculate the daily volume factor
    vol = ohlc_df["volume"]
    high_low_diff = ohlc_df["high"] - ohlc_df["low"]
    volume_factor = (vol / vol.mean()) * (high_low_diff / high_low_diff.mean())
    
    return volume_factor

This function takes in tick data as input, converts it into a DataFrame with columns for timestamp, price, and volume, sets the timestamp column as the index, resamples the data to daily OHLC bars while summing the volume, and calculates the volume factor using the same formula as before but using the daily OHLC data.

Third, encourage ChatGPT to write more complex factors.

I: We need a new financial factor that captures the correlation between volume and price. Could you please write a function for the new factor?

ChatGPT:

Certainly, here is a function that calculates a new financial factor that captures the correlation between volume and price using a rolling window:

import pandas as pd

def rolling_correlation(df, window_size):
    close = df['Close']
    volume = df['Volume']
    rolling_corr = close.rolling(window_size).corr(volume)
    return rolling_corr

The function takes a pandas DataFrame as input, where the DataFrame contains the daily OHLC (open-high-low-close) prices as well as trading volume for the stock. The function also takes a window size parameter that determines the size of the rolling window used to calculate the correlation.

The function first extracts the daily closing price and trading volume data from the DataFrame. It then calculates the rolling correlation between the two variables using the rolling() method in pandas with the corr() method applied to each rolling window.

This factor captures the correlation between price and volume over time, and may be useful for identifying trends and patterns in the relationship between the two variables.

I: Please upgrade the factor, make sure that it can capture the non-linear relationship between price and volume. Please do not use machine learning. Could you please write the codes?

ChatGPT:

To capture the non-linear relationship between price and volume, we can use a method called rank correlation. One common rank correlation measure is the Spearman’s rank correlation coefficient.

Here’s an updated function that calculates the rolling Spearman’s rank correlation coefficient between price and volume:

import pandas as pd
from scipy.stats import spearmanr

def rolling_spearmanr(df, window_size):
    close = df['Close']
    volume = df['Volume']
    rolling_spearmanr, _ = spearmanr(close.rolling(window_size).mean().values, 
                                      volume.rolling(window_size).mean().values)
    return rolling_spearmanr

The function first calculates the rolling mean of the daily closing price and trading volume data using the rolling() method in pandas. It then calculates the Spearman's rank correlation coefficient between the two variables using the spearmanr() function from the scipy.stats module.

This factor captures the non-linear relationship between price and volume over time, and may be useful for identifying trends and patterns in the relationship between the two variables. Note that this method does not require the use of machine learning algorithms.

Third, improve the complexity. Encourage ChatGPT to write more complex factors. We provide ideas, and ChatGPT writes codes.

I: Please use an improved method that can replace spearmanr. Please do not use machine learning methods. Please upgrade the codes.

ChatGPT:

One possible method to improve the correlation calculation between price and volume is to use the Rank-Biserial Correlation Coefficient. This coefficient measures the strength of association between a binary variable (in our case, whether the price increased or decreased) and a continuous variable (volume).

Here is an updated version of the function that uses the Rank-Biserial Correlation Coefficient:

import numpy as np
import pandas as pd
from scipy.stats import rankdata
from typing import Tuple

def volume_price_correlation(data: pd.DataFrame, lookback: int = 20) -> Tuple[pd.Series, pd.Series]:
    """
    Calculate the correlation between volume and price over a lookback period.
    
    Parameters:
    data (pd.DataFrame): a DataFrame containing 'close' and 'volume' columns
    lookback (int): the number of days to look back
    
    Returns:
    A tuple of two pandas Series:
    - The first series contains the rank-biserial correlation coefficients for each day
    - The second series contains the p-values associated with each coefficient
    """
    # Calculate the daily price changes
    prices = data['close']
    changes = prices.pct_change().shift(-1)
    up_down = np.sign(changes)
    
    # Calculate the rank-biserial correlation coefficients for each day
    corr_list = []
    pval_list = []
    for i in range(lookback, len(data)):
        volume = data['volume'][i-lookback:i]
        rank_changes = rankdata(changes[i-lookback:i])
        rank_changes = rank_changes / float(len(rank_changes))
        corr, pval = stats.rankbiserialr(rank_changes, volume)
        corr_list.append(corr)
        pval_list.append(pval)
    
    # Convert the lists to pandas Series and return them
    corr_series = pd.Series(corr_list, index=data.index[lookback:])
    pval_series = pd.Series(pval_list, index=data.index[lookback:])
    return corr_series, pval_series

Note that the changes are still calculated using the percentage change between consecutive days, but now the binary variable is whether the price increased or decreased. We also calculate the rank of the changes within the lookback period and normalize them, then calculate the rank-biserial correlation coefficient using the rankbiserialr() function from the scipy.stats module.

This function returns a tuple of two pandas Series: the first one contains the rank-biserial correlation coefficients for each day, and the second one contains the p-values associated with each coefficient.

Summary

In this article, we first encourage ChatGPT to generate five common financial factors, then guide it to develop a new factor, and finally improve upon the volume-price divergence factor by taking it through a step-by-step process. ChatGPT provides two non-linear and non-machine learning methods for volume-price divergence: Spearman Rank and Rank-Biserial Correlation Coefficient. By guiding ChatGPT through the code-writing process step-by-step, it can be a useful tool in developing financial factors. In this way, we can obtain factors easily and incorporate the codes into FinRL.

How can we guide ChatGPT to generate financial factors, using volume-price divergence in FinRL as an example?

Written by Ming Zhu

Responses (1)