【Data Analysis(12)】Lasso Regression Model

TEJ 台灣經濟新報

Published in

TEJ-API Financial Data Analysis

6 min readApr 21, 2022

Effective Explanatory Variables for Economic Growth

Highlights

Difficulty：★★☆☆☆
Apply Lasso Model to find effective variables on explaining economic growth
Reminder：We would firstly select data and conduct pre-processing. Subsequently, implement the fitting. In the context, we would not discuss mathematics theorem but simply describe the function and meaning of the model. However, it is a requirement for you that have basic knowledge of Statistics. As for Lasso Model, we would introduce in Preface area.

Preface

Least Absolute Shrinkage and Selection Operator, short as Lasso, is mainly used for variable selection and regularization in Regression. The function of “Penalty” setting would in Lasso lets us adjust the complexity. Therefore, with Lasso, we are able to alleviate “Overfitting”.

Penalty in the model is used to determine the weights between “Error” and “Amount of Variable”. Namely, we would not only consider the goal to minimize error, but try to reduce amount of variable so as to achieve an “adequate” complexity. Hence, if we set a small parameter on penalty, the model will prefer “reducing error”. On the other hand, a large parameter represent that model emphasize “reducing variable amount”.

Note: Penalty parameter must greater than 0 to match the condition of “considering less variables”. To boot, the parameter setting name is “Alpha” in Python package.

Editing Environment and Modules Required

MacOS & Jupyter Notebook

# Basic
import numpy as np
import pandas as pd# Graph
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()# TEJ API
import tejapi
tejapi.ApiConfig.api_key = 'Your Key'
tejapi.ApiConfig.ignoretz = True

Database

Macroeconomics Data Explain Table: Illustrate information about recorded Macroeconomics data. Code is “GLOBAL/ABMAR”.

Macroeconomics Data Table: Macroeconomics data from official government. Source: IMF, OECD and relatively professional issues. Code is “GLOBAL/ANMAR”.

Data Selection

Step 1. Import Basic Information of Data

factor = tejapi.get('GLOBAL/ABMAR',
                opts={'columns': ['coid','mdate', 'cname', 'freq']},
                chinese_column_name=True,
                paginate=True)

Step 2. Select Specific Data

# Selection
list1 = list(factor['總經代碼'][i] for i in range(0,6214) if '台灣' in factor.iloc[i,2] and factor['頻率代碼'][i] == 'Q')# Table
factor = factor[factor['總經代碼'].isin(list1)].reset_index().drop(columns =['None', '目前狀態', '頻率代碼'])

Since the amount of Macro indexes is extremely large and diverse. It is impossible to fit all data in model. As a result, we would only consider ”Quarterly Data of Taiwan”.

Step 3. Import Numeric Data

data = tejapi.get('GLOBAL/ANMAR',
                  mdate={'gte': '2008-01-01', 'lte':'2021-12-31'},
                  opts={'columns': ['coid','mdate', 'val', 'pfr']},
                  coid = list1, # 符合條件的指標
                  chinese_column_name=True,
                  paginate=True)

Data Pre-processing

Step 1. Remove Forecasting Data

data = data[data['預估(F)'] != 'F']

Step 2. Rearrange Table

data = data.set_index('年月')df = {}for i in list1:
    p = data[data['代碼'] == i]
    p = p['數值']
    df.setdefault(i, p)df = pd.concat(df, axis = 1)

We firstly set “Year-Month” as table index. Then, read each type of data. Lastly, arrange new table that each columns record different Macro indexes.

Step 3. Select Economic Growth Rate, Y

# Display all economic growth rate indexes
growth_reference = list(factor['總經代碼'][i] for i in range(0,427) if '經濟成長率' in factor.iloc[i,1])factor[factor['總經代碼'].isin(growth_reference)]# Select 'NE0904-季節調整後年化經濟成長率' as Y
growth = df['NE0904']

Since Taiwan is export-oriented, its economic performance is easily affected by global consumption cycle. We, therefore, choose “NE0904-Seasonal Adjusted Annualized Rate(saar)” as the reference of economic growth.

# Remove economic growth data in df 
df = df.drop(columns = growth_reference)# Remove nan
df = df.dropna(axis = 1, how = 'any')

Step 4. Stationary Test

from statsmodels.tsa.stattools import adfuller
    
for i in df.columns.values:
    p_value = adfuller(df[i])[1]
    if p_value > 0.05:
        df = df.drop(columns = i)
        
df = df.dropna(axis = 1, how = 'any')print('解釋變數量：', len(df.columns))
print('經濟成長率定態檢定Ｐ值：', '{:.5f}'.format(adfuller(growth)[1]))

Implement stationary test on each variable with for loop. Remove data which is non-stationary. We would not conduct differencing. On top of that, calculate the amount of explanatory variable, which is 148. Lastly, conduct stationary test on economic growth rate. P-value is 0.0000.

Model Construction

Step 1. Import Packages & split Data

from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeaturesdf_train = df.head(45)
df_valid = df.tail(10)growth_train = growth.head(45)
growth_valid = growth.tail(10)

Step 2. Model Fitting

We would only show the code of “Big Alpha” model here. As for the code of medium and small alpha model, please check “Source Code”.

# big alpha modelLasso_l = Pipeline(steps = [('poly', PolynomialFeatures(degree = 1)), ('Lasso', Lasso(alpha = 1000))])
large = Lasso_l.fit(df_train, growth_train)
growth_pred_l = large.predict(df_valid)
large_alpha = list(growth_pred_l)print('大Alpha的MSE:', metrics.mean_squared_error(growth_valid, large_alpha))

Due to the amount of explanatory variable, which is 148, we would consider the effectiveness of each variable itself. We make degree as 1. Besides, in order to make model more stricter, we set Alpha with three class, 10, 100 and 1000.

MSE of each model are as follow:

Big Alpha MSE: 207.82
Medium Alpha MSE: 526.29
Small Alpha MSE: 1399.59

According to above comparison, we would tell that the big alpha model outperforms others. Subsequently, we would visualize valid dataset and select the final model.

Model Comparison＆Finding Effective Explanatory Variables

Step 1. Rearrange Table

pred_data = {'小Alpha預測值': small_alpha, '中Alpha預測值': medium_alpha, '大Alpha預測值':large_alpha}
result = pd.DataFrame(pred_data, index = growth_valid.index)
final = pd.concat([growth_valid, result], axis = 1)
final = final.rename(columns={'NE0904':'實際經濟成長率'})

Step 2. Visualization

# Make Python apply Chinese
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']plt.figure(figsize=(15,8))plt.plot(final['實際經濟成長率'])
plt.plot(final['小Alpha預測值'])
plt.plot(final['中Alpha預測值'])
plt.plot(final['大Alpha預測值'])plt.legend(('實際成長率', '小Alpha預測', '中Alpha預測', '大Alpha預測'), fontsize=16)

Based on above graph, we could clearly compare the three model, big(red) medium(green) and small(orange) alpha with actual number(blue). We conclude that result of big alpha model is closer to actual one than other two model. Hence, we would apply big alpha model to find effective variable for explaining economic growth rate.

Step 2. Effective Variables

# Re-fitting the model
lasso = Lasso(alpha = 1000)
mdl = lasso.fit(df_train,growth_train)# Display variables that coefficient is larger than 0
lasso_coefs = pd.Series(dict(zip(list(df_valid), mdl.coef_)))
coefs = pd.DataFrame(dict(Coefficient=lasso_coefs))
coid = coefs[coefs['Coefficient'] > 0].index# Match the Code of selected variables to find Chinese name
factor[factor['總經代碼'].isin(coid)]

According to above chart, we conclude that the majority of variables consists of international trade-related and finance-related data, which matches the condition of Taiwan, an export-oriented country. To boot, one of above variables is GDP of Education Industry. It proves that the improvement of education among population would benefit economic growth.Therefore, keep cultivating next generation is what we should notice.

Conclusion

With above context, we firstly show data selection and pre-processing. Subsequently, implement model fitting and comparison. Lastly, find the effective explanatory variables for economic growth. It is clear that we spare advanced data transformation or differencing so as to keep this article from redundancy. Of course you do not have to follow our steps. As for the setting of parameters in model, we encourage you to try your own set. Believe you would gain much knowledge by practice. Last but not least, if you are interested in model construction, but concern the data source. Welcome to purchase the plans offered in TEJ E Shop and use the well-complete database to implement your own model.