“The First Trading Day of a Month” Effect... Is that real?

Yatshun Lee
The Modern Scientist
5 min readAug 6, 2023

One day my friend brought this observation to me… as a statistician, what would you do?

Photo by Yiorgos Ntrahas on Unsplash

Observation

Talking about my friend (Victor), a VOO die-hard fan who uses the dollar cost averaging method to contribute monthly to the ETF. And one day, he showed me the following…

Open price data of VOO from 2019–01–01 to 2023–08–01. Image by Author.

Interestingly, the occurrences in the first three days are abnormally high. So, that came with his claim:

I believe that buying at the marketing opening time on the first trading day of a month can help him earn more when using the dollar cost averaging method.

On the graph, you can easily say that it’s so apparent that the first three days are better for buying as the prices are generally the lowest in a month. However, is a plot enough to be a strong piece of evidence? If not, how can you assess that statistically?

Data

I used yahoo finance API to retrieve the OHLC daily prices. The sampling starts on Jan 1st of 2019, and ends on Jul 31st of 2023.

import yfinance as yf

sticker = 'VOO'
start = '2019-01-01'
end = '2023-08-01'

df = yf.download(sticker, start=start, end=end)

df = df.reset_index()
df['month'] = df.Date.apply(lambda x: x.month)
df['year'] = df.Date.apply(lambda x: x.year)
df['day'] = df.Date.apply(lambda x: x.day)

df = df[['Open', 'month', 'year', 'day']]
df.head()
Retrieved dataframe from Yahoo API. Image by Author.

Null Hypothesis (H0)

  • H0: Victor’s result performs averagely, i.e.,

μ(wealth accumulation that he buys randomly in a month) = μ(wealth accumulation that he buys only on the 1st day)

  • H1: Victor’s result performs differently, compared to the buying randomly, i.e.,

μ(wealth accumulation that he buys randomly in a month) ≠ μ(wealth accumulation that he buys only on the 1st day)

We now have the null hypothesis: to testify if there is any mean difference. However, there are some important questions to solve:

  1. What is the underlying distribution of the test statistic?
  2. The data is a series of sequential open prices. How do you compare?

To resolve this, I used Monte Carlo Simulation to analyse numerically.

Monte Carlo Simulation for the Dollar Cost Averaging Method

I used this method to generate random trajectories, which are the situations when you buy randomly in a month at the open price, to compare with the performance of the first trading day method. I resampled 10000 times every month and year, i.e., there are 10000 random possible trajectories for doing the dollar cost averaging method.

Some random trajectories from Monte Carlo Simulation. Image by Author.

You can also get the “First Trading Day” effect in the meantime, looping the nested for loop. The following code is to perform the resampling and retrieving of the open price trajectories.

Illustration of below code. Image by Author.
n_sim = 10000

years = sorted(df.year.unique())
months = sorted(df.month.unique())

trajectories = []
first_trading_day_trajectory = []

for yr in years:
for mn in months:
cond1 = df.year == yr
cond2 = df.month == mn

selection = df.loc[cond1&cond2]
if selection.shape[0]==0:
continue

resamples = selection['Open'].sample(n_sim, replace=True).values
trajectories.append(resamples)

first_price = selection.iloc[0].Open
first_trading_day_trajectory.append(first_price)

# these 2 are the "open price trajectories"
trajectories = np.array(trajectories)
first_trading_day_trajectory = np.array(first_trading_day_trajectory)

Once you have got arrays of open prices, you can set the monthly contribution as a certain amount (I set $1000 monthly here) and turn them into cumulative sum results by:

monthly_pay = 1000 # for the dollar cost averaging

last_price = df.Open.iloc[-1]

# simulation results
return_trajectories = 1 + (last_price - trajectories) / trajectories
mc_simulation_results = np.sum(return_trajectories * monthly_pay, axis=0)

# the result of the first trading day effect
first_trading_day_return = (
1 + (last_price - first_trading_day_trajectory) /
first_trading_day_trajectory
)
first_trading_day_result = np.sum(first_trading_day_return * monthly_pay)

# visualize
plt.figure(figsize=(10, 4))
sns.histplot(mc_simulation_results)
plt.axvline(x=first_trading_day_result, c='red', linestyle='--')
plt.xlabel('Cummulative Sum ($)')
We now have 10000 simulation results and the result of the proposed method. Image by Author.

The proposed method outperforms 97.649% of the results from random sampling. We can barely draw a conclusion that we reject the null hypothesis at the 5% confidence level but we fail to reject it at the 1% confidence level.

Sensitivity Analysis

As randomness is the critical factor in the Monte Carlo Simulation that can dominate the simulation results, it is essential to know how it can affect the p-value by increasing or decreasing the simulation times.

Pseudo code:

for i in {1, …, M}:
do Monte Carlo simulation for N times
compute the test statistc

By doing the Monte Carlo Simulations M times more (I set M = 100), which is the outer loop, we can see that the confidence interval containing the true probability of the outperformance converges as N (the number of simulation times) increases.

Simulation times and its estimated mean. Image by Author.

At last, we have a result that is an average of 97.66%, outperforming the random sampling results. It shows that the null hypothesis is significantly rejected at the 5% of the confidence level.

Finally…

As we have found, although the difference is significant, the actual difference compared to the mean performance is just 0.7% for the empirical analysis. It is still interesting to conduct experiments and testify to the correctness of the hypothesis.

It may be different for another time frame or another stock. I chose this as my friend started investing in it in 2019.

Regarding the analysis and conclusion I have drawn, I welcome your comments. I may have some flawless, but still, I am willing to discuss that :D. Disclaimer: it’s just a sharing. Please read with caution.

--

--