【Data Analysis(12)】Lasso Regression Model

TEJ 台灣經濟新報
TEJ-API Financial Data Analysis
6 min readApr 21, 2022

Effective Explanatory Variables for Economic Growth

Photo by Luke Chesser on Unsplash

Highlights

  • Difficulty:★★☆☆☆
  • Apply Lasso Model to find effective variables on explaining economic growth
  • Reminder:We would firstly select data and conduct pre-processing. Subsequently, implement the fitting. In the context, we would not discuss mathematics theorem but simply describe the function and meaning of the model. However, it is a requirement for you that have basic knowledge of Statistics. As for Lasso Model, we would introduce in Preface area.

Preface

Least Absolute Shrinkage and Selection Operator, short as Lasso, is mainly used for variable selection and regularization in Regression. The function of “Penalty” setting would in Lasso lets us adjust the complexity. Therefore, with Lasso, we are able to alleviate “Overfitting”.

Penalty in the model is used to determine the weights between “Error” and “Amount of Variable”. Namely, we would not only consider the goal to minimize error, but try to reduce amount of variable so as to achieve an “adequate” complexity. Hence, if we set a small parameter on penalty, the model will prefer “reducing error”. On the other hand, a large parameter represent that model emphasize “reducing variable amount”.

Note: Penalty parameter must greater than 0 to match the condition of “considering less variables”. To boot, the parameter setting name is “Alpha” in Python package.

Editing Environment and Modules Required

MacOS & Jupyter Notebook

# Basic
import numpy as np
import pandas as pd
# Graph
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
# TEJ API
import tejapi
tejapi.ApiConfig.api_key = 'Your Key'
tejapi.ApiConfig.ignoretz = True

Database

Macroeconomics Data Explain Table: Illustrate information about recorded Macroeconomics data. Code is “GLOBAL/ABMAR”.

Macroeconomics Data Table: Macroeconomics data from official government. Source: IMF, OECD and relatively professional issues. Code is “GLOBAL/ANMAR”.

Data Selection

Step 1. Import Basic Information of Data

factor = tejapi.get('GLOBAL/ABMAR',
opts={'columns': ['coid','mdate', 'cname', 'freq']},
chinese_column_name=True,
paginate=True)

Step 2. Select Specific Data

# Selection
list1 = list(factor['總經代碼'][i] for i in range(0,6214) if '台灣' in factor.iloc[i,2] and factor['頻率代碼'][i] == 'Q')
# Table
factor = factor[factor['總經代碼'].isin(list1)].reset_index().drop(columns =['None', '目前狀態', '頻率代碼'])

Since the amount of Macro indexes is extremely large and diverse. It is impossible to fit all data in model. As a result, we would only consider ”Quarterly Data of Taiwan”.

Step 3. Import Numeric Data

data = tejapi.get('GLOBAL/ANMAR',
mdate={'gte': '2008-01-01', 'lte':'2021-12-31'},
opts={'columns': ['coid','mdate', 'val', 'pfr']},
coid = list1, # 符合條件的指標
chinese_column_name=True,
paginate=True)

Data Pre-processing

Step 1. Remove Forecasting Data

data = data[data['預估(F)'] != 'F']

Step 2. Rearrange Table

data = data.set_index('年月')df = {}for i in list1:
p = data[data['代碼'] == i]
p = p['數值']
df.setdefault(i, p)
df = pd.concat(df, axis = 1)

We firstly set “Year-Month” as table index. Then, read each type of data. Lastly, arrange new table that each columns record different Macro indexes.

Step 3. Select Economic Growth Rate, Y

# Display all economic growth rate indexes
growth_reference = list(factor['總經代碼'][i] for i in range(0,427) if '經濟成長率' in factor.iloc[i,1])
factor[factor['總經代碼'].isin(growth_reference)]# Select 'NE0904-季節調整後年化經濟成長率' as Y
growth = df['NE0904']

Since Taiwan is export-oriented, its economic performance is easily affected by global consumption cycle. We, therefore, choose “NE0904-Seasonal Adjusted Annualized Rate(saar)” as the reference of economic growth.

# Remove economic growth data in df 
df = df.drop(columns = growth_reference)
# Remove nan
df = df.dropna(axis = 1, how = 'any')

Step 4. Stationary Test

from statsmodels.tsa.stattools import adfuller

for i in df.columns.values:
p_value = adfuller(df[i])[1]
if p_value > 0.05:
df = df.drop(columns = i)

df = df.dropna(axis = 1, how = 'any')
print('解釋變數量:', len(df.columns))
print('經濟成長率定態檢定P值:', '{:.5f}'.format(adfuller(growth)[1]))

Implement stationary test on each variable with for loop. Remove data which is non-stationary. We would not conduct differencing. On top of that, calculate the amount of explanatory variable, which is 148. Lastly, conduct stationary test on economic growth rate. P-value is 0.0000.

Model Construction

Step 1. Import Packages & split Data

from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
df_train = df.head(45)
df_valid = df.tail(10)
growth_train = growth.head(45)
growth_valid = growth.tail(10)

Step 2. Model Fitting

We would only show the code of “Big Alpha” model here. As for the code of medium and small alpha model, please check “Source Code”.

# big alpha modelLasso_l = Pipeline(steps = [('poly', PolynomialFeatures(degree = 1)), ('Lasso', Lasso(alpha = 1000))])
large = Lasso_l.fit(df_train, growth_train)
growth_pred_l = large.predict(df_valid)
large_alpha = list(growth_pred_l)
print('大Alpha的MSE:', metrics.mean_squared_error(growth_valid, large_alpha))

Due to the amount of explanatory variable, which is 148, we would consider the effectiveness of each variable itself. We make degree as 1. Besides, in order to make model more stricter, we set Alpha with three class, 10, 100 and 1000.

MSE of each model are as follow:

Big Alpha MSE: 207.82

Medium Alpha MSE: 526.29

Small Alpha MSE: 1399.59

According to above comparison, we would tell that the big alpha model outperforms others. Subsequently, we would visualize valid dataset and select the final model.

Model Comparison&Finding Effective Explanatory Variables

Step 1. Rearrange Table

pred_data = {'小Alpha預測值': small_alpha, '中Alpha預測值': medium_alpha, '大Alpha預測值':large_alpha}
result = pd.DataFrame(pred_data, index = growth_valid.index)
final = pd.concat([growth_valid, result], axis = 1)
final = final.rename(columns={'NE0904':'實際經濟成長率'})

Step 2. Visualization

# Make Python apply Chinese
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.figure(figsize=(15,8))plt.plot(final['實際經濟成長率'])
plt.plot(final['小Alpha預測值'])
plt.plot(final['中Alpha預測值'])
plt.plot(final['大Alpha預測值'])
plt.legend(('實際成長率', '小Alpha預測', '中Alpha預測', '大Alpha預測'), fontsize=16)

Based on above graph, we could clearly compare the three model, big(red) medium(green) and small(orange) alpha with actual number(blue). We conclude that result of big alpha model is closer to actual one than other two model. Hence, we would apply big alpha model to find effective variable for explaining economic growth rate.

Step 2. Effective Variables

# Re-fitting the model
lasso = Lasso(alpha = 1000)
mdl = lasso.fit(df_train,growth_train)
# Display variables that coefficient is larger than 0
lasso_coefs = pd.Series(dict(zip(list(df_valid), mdl.coef_)))
coefs = pd.DataFrame(dict(Coefficient=lasso_coefs))
coid = coefs[coefs['Coefficient'] > 0].index
# Match the Code of selected variables to find Chinese name
factor[factor['總經代碼'].isin(coid)]

According to above chart, we conclude that the majority of variables consists of international trade-related and finance-related data, which matches the condition of Taiwan, an export-oriented country. To boot, one of above variables is GDP of Education Industry. It proves that the improvement of education among population would benefit economic growth.Therefore, keep cultivating next generation is what we should notice.

Conclusion

With above context, we firstly show data selection and pre-processing. Subsequently, implement model fitting and comparison. Lastly, find the effective explanatory variables for economic growth. It is clear that we spare advanced data transformation or differencing so as to keep this article from redundancy. Of course you do not have to follow our steps. As for the setting of parameters in model, we encourage you to try your own set. Believe you would gain much knowledge by practice. Last but not least, if you are interested in model construction, but concern the data source. Welcome to purchase the plans offered in TEJ E Shop and use the well-complete database to implement your own model.

Source Code

Extended Reading

Related Link

You could give us encouragement by …
We will share financial database applications every week.
If you think today’s article is good, you can click on the
applause icononce.
If you think it is awesome, you can hold the
applause icon until 50 times.
If you have any feedback, please feel free to leave a comment below.

--

--

TEJ 台灣經濟新報
TEJ-API Financial Data Analysis

TEJ 為台灣本土第一大財經資訊公司,成立於 1990 年,提供金融市場基本分析所需資訊,以及信用風險、法遵科技、資產評價、量化分析及 ESG 等解決方案及顧問服務。鑒於財務金融領域日趨多元與複雜,TEJ 結合實務與學術界的精英人才,致力於開發機器學習、人工智慧 AI 及自然語言處理 NLP 等新技術,持續提供創新服務