Portfolio Management Using Multi-Agent Reinforcement Learning

9 min readOct 14, 2023

The work that I am about to share is being done by myself, Venkata Praneeth Donparthi, Colleen Boarman under the guidance of Dr. Ozgur Ozturk. We are all ML/AI Professionals and hike the ML/AI trail as our passion and profession.

There have been numerous efforts made to use Artificial Intelligence for stocks portfolio management showing encouraging results. Joining the bandwagon, we explored the domain as well but using the dynamicity of reinforcement learning. The research question was “How effective can a Multi-Agent Reinforcement Learning Algorithm be for Automated Stock Portfolio Management?”

The highlights of work include:

· Test and validate the model on periods covering both economic decline and rise periods

· Include lagging, leading, as well as coincident technical indicators in the reinforcement learning environment

· Add Sentiment Score as part of indicators in the environment

We selected three use cases:

1. 5 most volatile S&P 500 stocks

2. 5 least volatile S&P 500 stocks

3. A portfolio of 5 most & 5 least volatile stocks. (For this post, I will be referring to this use case)

Architecture

Now, let’s have a look at the architecture before diving into implementation:

10 stocks been selected with their leading, lagging, and coincident indicators where Sentiment score calculated from news publications is one of the lagging indicators. The environment created is being operated on by the three deep neural networks agents which test and trade during pre-defined periods. Testing is done to select the best agent out of the lot for the next trade. The results are then compared with the baseline index to represent real value of the system.

Implementation

Time Period Selection

We explored GDP & Stocks Historical Data, to cater to a comprehensive train/trade data set having recession, expansion, and stable economic periods. Based on our analysis we decided to go with time period 2005–2020 to cover the testing & trading periods.

Portfolio Selection

Amongst S&P 500 stocks, we used Yahoo Finance to fetch the closing prices for the selected period.

sp500 = yf.download(tickers,start=start_date, end=end_date)

Volatility of stocks are calculated as:

#Calculate daily volatility
daily_volatility = df_close.pct_change().apply(lambda x: np.log(1+x)).std() # Calculate the daily volatility
#Calculate the weekly volatility(considerign 5 days per week
weekly_volatility = daily_volatility.pct_change().apply(lambda x: x*np.sqrt(5))

# Create a new dataframe and reset the index.
WV           =pd.DataFrame(weekly_volatility).reset_index() 
Renames the columns to "Tick", "Volatility"
WV.columns   =["Tick","Volatility"]                          

Sort the values using Volatility in desc.
sorted_weekly_volatility  = WV.sort_values(by="Volatility",ascending=False)  

# The top 5 or High voaltile stocks.
sorted_weekly_volatility.head()

The five most volatile stocks were Enphase Energy Inc (ENPH), KeyCorp (KEY), Delta Airlines Corp (DAL), Lincoln National Corplea (LNC), Etsy Inc (ETSY) and the five least volatile stocks were Xcel Energy Inc (XEL), Procter & Gamble Co (PG), Alliant Energy Corporation (LNT), PepsiCo Inc. (PEP), Dominion Energy Inc (D). We found that the correlation between economic indicators and the less volatile stocks is very high:

Correlations between Stocks and Economic Indicators

Sentiment Classification

Sentiment Scores were calculated on news publications relevant to the selected portfolio for the test and trading time periods, these sentiment scores are then utilized as one of the lagging indicators in RL environment.

News Publication Data Source is Daily Financial News 2009–2020

The text is pre-processed before passing it through the sentiment classification process.

def process_sentence(sentence):
    """
    Process the sentence to convert to lower case, remove digits, puntuation, and stopwords.
    """
    
    l_sentence = sentence.lower()                    # Convert the sentence to lower case.
    rd_sentence = re.sub(r'\d+', '', l_sentence)     # Remove the digits from the sentence.
    # Replace punctuations in a sentence.
    plain_sentence = rd_sentence.translate(str.maketrans(dict.fromkeys(string.punctuation)))
    plain_sentence = plain_sentence.strip()          # Strip the extra white speaces from corners
    return remove_stopwords(plain_sentence)   # Remove stop words.

Followed by calculating polarity of each relevant article for appropriate time period.

def get_sentiment_polarity(sentence):
    """
    Returns the sentiment polarity of the sentence provided.
    """
    
    processes_sentence = process_sentence(sentence)
    return TextBlob(processes_sentence).sentiment.polarity  # Return the sentence sentiment polarity/ score.

Here we have used TextBlob that provides an API for NLP processing.

Gross Domestic Product (GDP) Index

Once we have the sentiment scores, another indicator was added to the environment i.e., GDP. This provides a fair indication of U.S economic activity.

gdp = web.DataReader("GDP", 'fred', 2010, 2020)
gdp = gdp.reset_index()

date = pd.date_range(start='2010-01-01', end='2020-12-30')
date_df = pd.DataFrame()
date_df['date'] = date

gdp['date']=pd.to_datetime(gdp['DATE'])
gdp.rename(columns = {'GDP':'gdp'}, inplace = True)
gdp = gdp.drop(['DATE'], axis=1)
gdp_df=gdp.merge(date_df, on='date', how='right')
gdp_df = gdp_df.fillna(method='ffill')
gdp_df = gdp_df.merge(pd.DataFrame({"tic":yf_df.tic.unique()}),how="cross")

Building Reinforcement Learning Environment

FinRL library was utilized which provides functions to create a custom multi-agent reinforcement learning model. Exploring the package, we ran the use cases provided and built on top of it.

After importing the necessary libraries, we defined the Test/Trade time periods and combined stock values for selected portfolio, sentiment scores, and gdp index.

df = pd.merge(pd.merge(yf_df, sentiments_df, how='left', left_on=['date','tic'], right_on = ['date','tic']),
              gdp_df, how='left', left_on=['date','tic'], right_on = ['date','tic'])

FinRL’s FeatureEngineer was used to merge other technical indicators to the dataset. The technical indicators we utilized are moving average convergence divergence (MACD), relative strength index (RSI), commodity channel index (CCI) and directional movement (DX) on the historical stock data. Technical indicators are pattern-based cues generated by price, volume and other factors that can be used by trades for a scientific analysis. The MACD, RSI and CCI are technical indicators are lagging indicators meaning they change following the state of the economy whereas DX is a leading indicator meaning it changes prior to the economic state whereas GDP is a coincidental indicator in the economic industry meaning it changes along with the status of the economy.

technical_indicators = ['macd', 'rsi_30', 'cci_30', 'dx_30']
fe_pipeline = FeatureEngineer(use_technical_indicator=True, 
                              tech_indicator_list = technical_indicators, 
                              use_turbulence=True, 
                              user_defined_feature = True
                              )
df_processed = fe_pipeline.preprocess_data(df)

Environment parameters are defined as:

udfs = 2 # Sentiment Scores, GDP
stock_dimension = len(df_processed.tic.unique())
state_space = 1 + 2 * stock_dimension + len(technical_indicators)*stock_dimension + udfs*stock_dimension

Observation/state dimensions are calculated as current amount + (current shares count + stock price)*no. of stocks + pre-defined FinRL Technical Indicators*no. of stocks + (GDP + Sentiment Scores)*no. of stock

Action Space will be -N to N for each stock where -N is sell, 0 is hold, and +N is buy. N is number of shares

Environment variables are consolidated as:

env_variables = { "hmax": 100,                        # Maximum number of shares that can be traded at a time
                  "buy_cost_pct": 0.001,              # Transaction cost for buying
                  "sell_cost_pct": 0.001,             # Transaction cost for selling
                  "state_space": state_space,         # State space for the environment
                  "tech_indicator_list": technical_indicators,  # List of technical indicators used
                  "action_space": stock_dimension,    # Action space for the environment
                  "reward_scaling": 1e-4,             # Scaling factor for rewards
                  "print_verbosity": 5                # Level of detail for logging during training
                  }

Next is to initialize and define hyperparameters of RL agents. A2C, PPO, and DDPG as RL Agents and their hyperparameters were defined as:

A2C_model_kwargs = {'n_steps': 5, 'ent_coef': 0.005, 'learning_rate': 0.0007}

Similarly for PPO and DDPG.

We selected Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), and Deep Deterministic Policy Gradient (DDPG). A2C and DDPG are both off-policy RL methods meaning they drive towards optimal policy independent from the agent’s action. The off-policy methods predict expected reward of an action and adjusts, without acting out the policy. On the other hand, PPO is an on-policy method meaning the algorithm determines the optimal policy and acts said policy out exactly. A2C tries to learn stochastic policy which indicates there can be more than one possible action in any state whereas DDPG tries to learn deterministic policy meaning there is one action possible in each state.

Ensemble Agent

Defining the ensemble agent is the next step in building our model. First, we defined the rebalance window and the validation window. In the world of finance, three months of stock market activity (when taking into account weekends and holidays) equates to 63 days. Therefore, 63 days is used for the rebalance window. Additionally, each agent will be validated in a 63-day window. This occurs after the training period in order for the algorithm to select the best performing agent based on the Sharpe ratio produced. We designate the initial investing amount to be 10,000 dollars. The stock dimension, processed data frame, train period, validation period and environmental variables are all passed into the ensemble agent function

The Sharpe ratio quantifies the return on investment to the amount of risk incurred. Since the Sharpe ratio demonstrates return to volatility, the higher the Sharpe ratio, the more attractive the risk adjusted return and therefore, the better the model performed. Generally speaking, a Sharpe ratio greater than 1 is considered good, however it is most important to evaluate Sharpe ratios comparatively against others from a similar context.


timesteps_dict = { 'a2c': 10_000, 'ppo': 10_000, 'ddpg': 10_000 } #Defining time steps

rebalance_window = 63
validation_window = 63

test_dates = df_processed[(df_processed.date > TEST_START_DATE)&(df_processed.date <= TEST_END_DATE)].date.unique() # Get the datest to which the testing predictions.

trade_date_df = pd.DataFrame({'datadate':test_dates}) # Convert the dates to a datafr

ensemble_agent = DRLEnsembleAgent(initial_amount = 10000,
                                  stock_dim = stock_dimension,
                                  df = df_processed,
                                  train_period=(TRAIN_START_DATE,TRAIN_END_DATE),
                                  val_test_period=(TEST_START_DATE,TEST_END_DATE),
                                  rebalance_window=rebalance_window,
                                  validation_window=validation_window,
                                  **env_variables)

df_summary = ensemble_agent.run_ensemble_strategy(A2C_model_kwargs, PPO_model_kwargs, DDPG_model_kwargs, timesteps_dict)

The table below shows a summary of the ensemble agent output for each validation period. The Sharpe Ratio is calculated for each RL algorithm and the “Model Used” column identifies based on the Sharpe ratio which algorithm was determined to be most successful.

Using the ensembled results we got the complete picture of point in time current amount through the trading period.

results_list = []
# for every cycle (rebalance_window + validation_window) read the csv files into a dataframe
for i in range(rebalance_window+validation_window, len(test_dates)+1,rebalance_window):
     result_df = pd.read_csv(f'results/account_value_trade_ensemble_{i}.csv')
     results_list.append(result_df)

df_ensemble_results  = pd.concat(results_list, ignore_index=True)  # Create a dataframe to store execution results

Performance Evaluation

We measured the performance of our work by comparing with S&P 500 Index stocks by rescaling to initial beginning as $10,000.

df_index_scaled = pd.DataFrame()

df_index_scaled["date"] = df_ensemble_results["date"]

df_index_scaled["spy"] = df_index_masked['close'] / df_index_masked.iloc[0]['close']

We used Quantopian pyfolio package exposed by finRL library was used to get backtesting stats.

The backtest stats also provide insight into the performance of the model. This method involves applying a predictive model to historical data in order to assess the model’s accuracy without risking capital. Backtesting primarily sheds light on the overall profitability and risk level of the trading strategy (Corporate Finance Institute). The below tables show a side by side of our reinforcement learning model’s backtest metrics (Left) and the baseline backtest metrics (Right). From the backtest results, we can see that our model outperforms the baseline in every metric.

Backtest Stats: RL Model (Use Case 1) Vs Baseline

From the graph below, it is clear that the model did not perform well in the beginning because the account value is consistently below the baseline value. However, over time, the model’s performance improved. From the graph, we can also see that in one instance, the baseline values drop substantially, but the model’s account values did not, meaning that our model was successful in making profitable decisions compared to the baseline.

Time Series Graph: Baseline Vs. RL Model (Use Case 1)

Out of all the three use cases, the one with 5 least volatile portfolio (Use Case 3) showed the best performance.

Time Series Graph: Baseline Vs. RL Model (Use Case 3)

Conclusion

Through this work, we can see that RL can be utilized for automated stocks portfolio management provided it is enriched with right technical indicators and rather stable stock profiles.

Note

Complete source code exists in the Github Repository
This post is not a tutorial on Reinforcement Learning but a showcase of how Reinforcement Learning in a Deep Learning Multi-Agent Environment can be used for enhanced profits in Automated Stocks Portfolio Management.

Useful Links:

The links below helped us to build our base and grow on top. FinRL Library and the examples provided are very well documented as well.