Feature engineering for predicting stock price movement

Won Seob Seo
Predict
Published in
6 min readJan 12, 2019

Recently I’ve participated in a Kaggle competition hosted by ‘Two sigma’. It is about using market and news data to predict stock price movement in 10 days. The detailed explanations of each columns are well written by host and would be helpful to read in order to understand what kind of feature I am generating. Also, you can read whole source code and explanation from my Kaggle kernel as well.

First thing to note about this competition is that it is a kernel only competition, so there are various resource limitations such as 16GB RAM and no GPU available. So it is quite impossible to use all the data and add bunch of new features on top of that. So naturally you would limit the period of data to more recent ones rather than using too old data. Also, after feature importances analysis, columns that don’t help should go away.

First of all as with most time series data it helps to make window(rolling) statistics features. Even in real life, a lot of investors will check such things as what was the peak/lowest price in 1 year, is current price higher/lower than its 1 month moving average and all that (to enumerate a few ‘Moving Average’, ‘Exponential Moving Average’, ‘Bollinger Band’, ‘The Relative Strength Index’, and ‘Volume Moving Average’). So it should make sense to create window statistics of 1 week, 2 week, 1 month and 1 year and so on. Use common and simple statistics like mean, median, max, min and exponentially weighted mean. Or be creative and come up with different things. Generate these window features based on numeric columns. In reality I generated lag features only on market data rather than news data (If I had more computational resources than I would have tried creating these from news data as well).

BASE_FEATURES = [ 'returnsOpenPrevMktres10', 'returnsOpenPrevRaw10',
'open', 'close']

Also some of the numbers don’t mean much without the context of their usual range and their ratios to market mean (which also changes as time passes). What I mean is, if a stock’s opening price is 300 dollars today, is it oversold or overbought? If I was talking about Amazon this would be insanely cheap because nowadays (as of beginning of 2019) it trades around 1600$. But if I was talking about Tesla stock, it would be only moderately cheap since Tesla stock as of now traded around 310 ~ 350 $ range. Same principle applies to many other features such as raw returns (gain/loss which is not adjusted against any benchmark), (trading) volume. So it would make sense to generate the ratio between these features and market mean.

def add_market_mean_col(market_df):
daily_market_mean_df = market_df.groupby('time').mean()
daily_market_mean_df = daily_market_mean_df[['volume', 'close']]
merged_df = market_df.merge(daily_market_mean_df, left_on='time',
right_index=True, suffixes=("",'_market_mean'))
merged_df['volume/volume_market_mean'] = merged_df['volume'] / merged_df['volume_market_mean']
merged_df['close/close_market_mean'] = merged_df['close'] / merged_df['close_market_mean']
return merged_df.reset_index(drop = True)

BASE_FEATURES = BASE_FEATURES + ['volume', 'volume/volume_market_mean', 'close/close_market_mean']

In a similar context, the ratio of opening price and closing price should tell us more than just raw closing/opening price.

def generate_open_close_ratio(df):
df['open/close'] = df['open'] / df['close']

BASE_FEATURES = BASE_FEATURES + ['open/close']

Likewise, we can generate the ratio of raw return values to its current opening/closing prices. Note that this is not a duplicate of residual return columns which are the returns after movement of the market as a whole has been accounted for, leaving only movements inherent to the instrument. The generated ratio is not adjusted by market movements it is just a ratio of its price delta versus its price.

open_raw_cols = ['returnsOpenPrevRaw1', 'returnsOpenPrevRaw10']
close_raw_cols = ['returnsClosePrevRaw1', 'returnsClosePrevRaw10']

def raw_features_to_ratio_features(df):
for col in open_raw_cols:
df[col + '/open' ] = df[col] / df['open']
for col in close_raw_cols:
df[col + '/close'] = df[col] / df['close']

BASE_FEATURES = BASE_FEATURES + ['returnsClosePrevRaw1/close', 'returnsClosePrevRaw10/close', 'returnsOpenPrevRaw1/open', 'returnsOpenPrevRaw10/open']

The previously mentioned window statistics feature is generated based on the BASE_FEATURES we gathered. And merged on time and assetCode(which is like the id column of the market data)

new_df = generate_features(market_train_df)
market_train_df = pd.merge(market_train_df, new_df, how = 'left', on = ['time', 'assetCode'])

Because of the memory limitation I dropped many columns from news data. However, considering that the name of this competition is called “Using News to Predict Stock Movements”, I thought I should make use of the news data in some way. Also thinking about reality, a lot of traders and machines react to positive/negative news so it makes sense that news affects stock prices. So I decided that the most interesting columns for me are ‘sentimentClass’, ‘sentimentNegative’, ‘sentimentNeutral’ and ‘sentimentPositive’. ‘sentimentClass’ is counted (how many of news were neutral, positive, negative), the rest are taken as mean (mean of over all neutralness etc). Note that ‘assetName’ column acts as id column for news data. ‘assetCodes’ have list of asset codes and sometimes there are few dozens of codes per single news. This would makes sense if we examine some of the examples of ‘assetCodes’ and ‘assetName’.

news_train_df.head(100)[['assetCodes', 'assetName']]

From results I can see that one news relating to the asset name of ‘Microsoft Corp’ has related ‘assetCodes’ of ‘{‘MSFT.O’, ‘MSFT.F’, ‘MSFT.DE’, ‘MSFT.OQ’}’. These codes are just Microsoft stocks on different stock exchanges. So now it is clear that if one news is related to an asset name all related asset codes will be affected. Also merging table on ‘assetName’ is practically much easier because it is single name of an asset in both market and news data (On the contrary market data has ‘assetCode’ which has a single asset code and news data has ‘assetCodes’ which has more than one asset codes). So now let’s transform sentiment column accordingly and merge market and news data.

def merge_with_news_data(market_df, news_df):
news_df['firstCreated'] = news_df.firstCreated.dt.hour
news_df['assetCodesLen'] = news_df['assetCodes'].map(lambda x: len(eval(x)))
news_df['asset_sentiment_count'] = news_df.groupby(['assetName', 'sentimentClass'])['firstCreated'].transform('count')
kcol = ['time', 'assetName']
news_df = news_df.groupby(kcol, as_index=False).mean()
market_df = pd.merge(market_df, news_df, how='left', on=kcol, suffixes=("", "_news"))
return market_df

market_train_df = merge_with_news_data(market_train_df, news_train_df)

Now that I have all the features I want to use let’s train some models. I used 3 models from 3 library lightgbm, catboost and XGBoost. I simply thought it would be interesting to compare how each library prioritise features differently. Now all that is left is train with the models check the feature importances. It is useful to know how many features we have and what is the contribution of each columns on average.

print("total features:", len(fcol), ", average:", 100/len(fcol))
=> total features: 270 , average: 0.37037037037037035

On average each feature contributed 0.37%. Let’s examine all features and their importances in percentage.

def show_feature_importances(feature_importances):
total_feature_importances = sum(feature_importances)
assert len(feature_importances) == len(fcol) # sanity check
for score, feature_name in sorted(zip(feature_importances, fcol), reverse=True):
print('{}: {}'.format(feature_name, score/total_feature_importances * 100))
show_feature_importances(gbm.feature_importance(importance_type='split'))
=> assetCodeT: 2.359697858792684
close_market_mean: 2.3284849241525687
volume_market_mean: 1.879018665334915
open/close_window_20_max: 1.4332979586740746
open/close_window_20_min: 1.4114489044259941
open/close_window_10_max: 1.3533928459953806
returnsOpenPrevMktres10_window_20_min: 1.33965915475373
...

It’s good to see that the features I created positions nicely on top among the most important features. But it seems that many features have low importance (some even 0). Let’s pick a number (0.1% for example) that’s below average importance and use it as threshold to get all low-importance features. We can get rid of those in the next iteration or production.

def get_non_important_features(feature_importances, threshold):
total_feature_importances = sum(feature_importances)
assert len(feature_importances) == len(fcol) # sanity check
return [feature_name for score, feature_name in sorted(zip(feature_importances, fcol), reverse=True) if ((score * 100) / total_feature_importances) < threshold]

non_features = get_non_important_features(gbm.feature_importance(importance_type='split'), threshold = 0.1)
print(len(non_features))
non_features
=> 123
['open_window_5_min',
'close/close_market_mean_window_10_median',
'open_window_20_median',
'close_window_5_min',
...

I think this is quite good work flow in feature generation. In summary,

  1. Experiment making new features
  2. Train models using the features
  3. Check feature importances
  4. Filter out unimportant features
  5. Repeat 1–4 until you are happy with the features you have
  6. Use those features for production / submitting

Thank you so much for reading and if you find some issues in my process or have some tips, tricks & suggestions, please let me know. Ah, also my Kaggle handle is wontheone1, in case you want to follow :)

--

--