Application of Mixed Integer Quadratic Programming (MIQP) in Feature Selection

Ping Zhang
11 min readJan 15, 2022

--

A fancy pic about cherry picking on Unsplash

Linear regression is a supervised learning algorithm used to predict a quantitative response. It assumes that there is a linear relationship between the feature vector xi∈R(d) and the response yi∈R . One of the most common problems in predictive analytics is variable selection for regression. Since Ordinary Least Squares (OLS) rarely yield estimates that are exactly zero, thus discarding the features not associated with the response, we need to resort to feature selection methods. Popular methods include:

· Subset selection, e.g. stepwise selection.
· Dimensionality reduction, e.g. principal component analysis.
· Shrinkage, e.g. the Lasso.

Direct variable selection using optimization has long been dismissed by the statistics/analytics community because of computational difficulties. This computational issue was part of the motivation for the development of LASSO and ridge regression. However, in the recent past there have been tremendous advancements in optimization software such as Gurobi, specifically the ability to solve mixed integer quadratic programming (MIQP).

In this article, we will pose the variable selection problem for regression as an MIQP, and compare it with LASSO. We will discuss the advantages and drawbacks of each method, and make recommendations on which method is better for which scenario.

Please kindly note that the modeling example in this article may require you to have some knowledge about building mathematical optimization models with the Gurobi Python API.

Approaches

In our experimentation, there are 2 data sets that include x and y data. One data set is a training data set, and one is a test data set. We will first do 10-fold cross validation on the training set to pick k or λ. Then with the optimal values of k or λ we will fit βs using the entire training set. Then with those βs we will make a prediction of the y values on the test set, and compare our prediction of y to the true value of y in the test set.

To make our results fully reproducible, all the relevant source codes and datasets have been made public at our github.

Approach Ⅰ: Direct Variable Selection — MIQP

MIQP is an optimization method that minimizes the regular ordinary least squares loss function, and includes several constraints, such as the Big M method, which involves using binary variables in order to include or exclude certain variables in the calculations during loss calculation. The other important constraint is the hyperparameter k, which is the maximum number of variables to be included in the function.

We now present a MIQP formulation that finds the weight estimates for a linear regression problem, where exactly k of those weights can be nonzero.

Parameters

(1) k: Number of features to include in the model, ignoring the intercept. There are 50 X variables, we will try k = [5,10,15,20,25,30,35,40,45,50]

(2) Q matrix: Quadratic component of the objective function. Here it is a (2m+1) * (2m+1) matrix where the upper left corner of the matrix is equal to X(T)X, and all other values are zero.

(3) c: Linear component of the objective function. Here it is a (2m+1) * 1 vector where the first (m+1) components are -2y(T)X, and the rest are zeros.

Decision Variables

(1) βj: Weight of feature Xj and continuous variables, representing the change in the response variable per unit-change of feature Xj.

(2) zj: Binary variables. 1 if βj is exactly equal to zero, and 0 otherwise. Auxiliary variable used to manage the budget constraint.

Objective Function

· Training error: Minimize the Sum of Squared Errors (SSE):

Using a few tricks from linear algebra we can pose the objective function as:

Constraints

(1) Mutual exclusivity: If zj is zero, the corresponding βj is zero. Otherwise, value of βj is between -M and M. choose M to be large enough so that no value of βj is equal to M or -M.

(2) Budget constraint: The number of non-zero features should not exceed k.

Based on the above, we set up the MIQP model as below. This function will return k weights selected by the MIQP model.

def beta(X, y, k):
new_column = [1] * len(X)
X_addConst = np.insert(X, 0, new_column, axis=1)

# define quadratic term Q
Q = np.zeros((2*m+1, 2*m+1))
Q[0:m+1, 0:m+1] = np.transpose(X_addConst) @ X_addConst

# define linear term c
X_expand = np.zeros((len(X_addConst), 2*m+1))
X_expand[0:len(X_addConst), 0:m+1] = X_addConst
c = -2 * np.transpose(y) @ X_expand

# define left hand side of constraint matrix A
A = np.zeros((2*m+1, 2*m+1))
A[:,0] = 0
A[0:m, 1:m+1] = np.diag(np.ones(m))
A[0:m, m+1:2*m+1] = np.diag(np.ones(m))*M
A[m:2*m, 1:m+1] = np.diag(np.ones(m))
A[m:2*m, m+1:2*m+1] = np.diag(np.ones(m))*(-M)
A[2*m:2*m+1, m+1:2*m+1] = [1]*m

# define right hand side of constraint matrix B
b = np.array([0]*(2*m+1))
b[-1] = k

# define lower boundary lb
lb = np.array([np.NINF] + [-M]*m + [np.NINF]*m) # lb = np.array([-M] * (2*m+1))

# define optimization sense
sense = np.array(['>']*m + ['<']*(m+1))

# set up model
MIQPMod = gp.Model()
MIQPMod_x = MIQPMod.addMVar(len(Q),lb=lb,vtype=['C']*(m+1)+['B']*m)
MIQPMod_con = MIQPMod.addMConstrs(A, MIQPMod_x, sense, b)
MIQPMod.setMObjective(Q,c,0,sense=gp.GRB.MINIMIZE)

MIQPMod.Params.TimeLimit = time
MIQPMod.optimize()

coef = MIQPMod_x.x
loss = MIQPMod.objVal

# return coefficients
return coef[0:m+1]

Notice that the training error decreases monotonically as more features are considered (which translates to relaxing the MIQP), but we also need to find the value for k that maximizes the performance of the regression on unseen observations. So it is advisable to estimate SSE via cross-validation. We randomly shuffle the data, split it into 10 folds, and do 10-fold cross validation with all possible values of k. Below shows how we calculate SSE and perform cross validation.

def least_squares(coef, X, y):
new_column = [1] * len(X)
X = np.insert(X, 0, new_column, axis=1)
a = X @ coef - y
b = np.transpose(a)
c = b @ a
return c
def cross_val(X, y, k):
accuracies = []
coefs = []
df = pd.DataFrame(columns = ['coef'])
df['coef'] = list(df_train.columns)
i = 1
for train_index, holdout_index in kf.split(X):
X_train, X_holdout = X[train_index], X[holdout_index]
y_train, y_holdout = y[train_index], y[holdout_index]
coef = beta(X_train, y_train, k)
sse = least_squares(coef, X_holdout, y_holdout)
df[str(i)] = coef
i = i+1
accuracies.append(sse)
sum_sse = sum(accuracies)
df.set_index(['coef'],inplace=True)
return sum_sse, df, accuracies

Next, we pick the value of k that corresponds to the smallest cross validation error. After running MIQP model for several hours, the best chosen k based on 10-folder cross validation is 10 and corresponding sse is 724.7876.

keys = []
values = []
coefs = [] # store each ten sets of coefficients for every k
accuracies = [] # store each ten sets of sse for every k
dict = {} # Store k and corresponding sse into dictionary
for k in k_all:
sse, coef, acc = cross_val(X_train, y_train, k)
keys.append(k)
values.append(sse)
coefs.append(coef)
accuracies.append(acc)
dict['k'] = keys
dict['sse'] = values
df = pd.DataFrame(dict) # Convert dictionary into dataframe
best_result = df.loc[df['sse'].idxmin()] #find the best chosen k based on minimun sse
print('The best chosen k based on 10-folder cross validation is', best_result[0])
print('Corresponding sse is', best_result[1])

Approach Ⅱ: Indirect Variable Selection — LASSO:

LASSO regression is a form of linear regression that uses shrinkage in order to prune variables from the model, and in doing so minimizes the loss function. LASSO uses the hyperparameter lambda (λ) to perform this shrinkage, where the larger λ grows, the more X variables are forced to absolute zero. Here is the formula:

We first set λ to be 100 evenly spaced numbers calculated over the interval [0.001, 1000]. Next, we do 10-fold cross validation on the training set to pick λ. Running LASSO regression on the same dataset costs only seconds. The best chosen λ based on 10-folder cross validation is 0.0756 and corresponding sse is 695.00097.

# Generate 100 uniform values between -3 to 3 as power series
alphas = 10**np.linspace(3,-3,100)
# Write function that computes the sse over all 10 folds for lasso.
def cv_accuracy_score_lasso(X, y, alpha):
model = Lasso(alpha=alpha)
accuracies = []
for train_index, holdout_index in kf.split(X):
X_train, X_holdout = X[train_index], X[holdout_index]
y_train, y_holdout = y[train_index], y[holdout_index]

model.fit(X_train, y_train) # Fit the model

#mse = mean_squared_error(y_holdout, model.predict(X_holdout)) # mse: average of squared error
sse = sum((y_holdout-model.predict(X_holdout))**2) # sse: sum of squared error
accuracies.append(sse)

sum_sse = sum(accuracies)
return sum_sse
keys = []
values = []
dict = {} # Store alpha and corresponding sse into dictionary
for alpha in alphas:
keys.append(alpha)
values.append(cv_accuracy_score_lasso(X_train, y_train, alpha))

dict['alpha_value'] = keys
dict['sse_value'] = values
df_lr = pd.DataFrame(dict) # Convert dictionary into dataframe
best_alpha = df_lr.loc[df_lr['sse_value'].idxmin()] #find the best chosen alpha based on minimun sse
print('The best chosen alpha based on 10-folder cross validation is', best_alpha[0])
print('Corresponding sse is', best_alpha[1])

Evaluations on test set

Direct Variable Selection — MIQP:

After we find the best k, we then fit the MIQP model on the entire training set using k=10, and obtain the intercept and coefficients as below:

Finally, we use the above intercept and βj obtained from MIQP to make a prediction of the y values in the test set. MIQP’s SSE on the test set is 116.8272.

Indirect Variable Selection — LASSO:

After we find the best λ, we then fit the LASSO regression on the entire training set using λ =0.0756, and obtain the intercept and coefficients as below. As we can see, LASSO selects 17 variables, which include 10 X variables selected by MIQP model, but have different coefficients.

Finally, we use the above intercept and βj obtained from LASSO to make a prediction of the y values in the test set. LASSO’s SSE on the test set is 117.4688.

In-Sample Comparison

According to the in-sample sse and coefficients returned by MIQP model and LASSO regression, we can see that:

(1) Generally, MIQP model has lower in-sample sse than LASSO regression.

(2) Regarding the relationship between training error and the number of features, MIQP and LASSO show different trends. Regarding MIQP, as k decreases to 10, the training error (sse) is decreasing most of the time. Regarding LASSO, as λ increases from 0.0756, the training error (sse) is increasing.

(3) The process of feature selection is smoother with LASSO, because LASSO allows us to gradually reduce the coefficient to 0 instead of directly eliminating an variable.

(4) With this dataset, MIQP selects 10 variables with training sse of 724.7876, while LASSO selects 17 variables with training sse of 695.00097. Since the training errors are closed, we may think MIQP achieves the same performance with fewer number of features.

Out-of-Sample Comparison

According to the out-of-sample sse returned by MIQP model and LASSO regression, MIQP is marginally better than LASSO. We believe this is because by shrinking β we add bias to the estimates. Furthermore, we observe that MIQP achieves better performance with fewer number of features. This is convenient, as it leads to a more interpretable model.

On the other hand, as we know, LSAAO is not scale-invariant because the budget constraint is based on the L1-norm. Remember that βj is interpreted as the change in the response per unit-change of feature Xj . Since the L1-norm takes the sum of absolute values, how much of the budget βj consumes depends on the units of measurement of the feature associated with it.

However, MIQP is scale invariant and does not add bias to the weight estimates. Furthermore, this approach is amenable to the specification of additional linear constraints[1], such as:

· Enforcing group sparsity among features.

· Limiting pairwise multicollinearity.

· Limiting global multicollinearity.

· Considering a fixed set of nonlinear transformations.

Conclusion

Based on the above, we have shown how optimization can be used to perform feature selection on linear regression problems. As the computational time of direct variable selection has decreased with the advent of better solvers, it is in fact a good alternative to the LASSO, given that MIQP is scale invariant and does not introduce bias to the weight estimates. Therefore, we would recommend incorporating MIQP if a 3-hour time span for solving tasks with such dataset sizes is acceptable. In other words, if we have sufficient computing power or are willing to invest in computing power for better results.

MIQP is better with higher-dimensional data, tends not to suffer from collinearity, has better performance, but takes a lot of time to run compared to LASSO. Even if our usual tasks have smaller datasets with lower dimensions, we would recommend incorporating it in case we encounter tasks with higher dimensions or require advanced performance.

However, if we want prompt results, are willing to settle with decent solutions with limited computing resources, or typically work on tasks with data of lower dimensions, then there may not be a need to incorporate MIQP. Especially if the incorporation process is too costly or time consuming. Since LASSO can solve such a task in a few minutes while MIQP takes hours, LASSO is also good for streaming.

To sum up, our advice would be “there is no free lunch in statistics”. That is, no algorithm outperforms all others under all possible datasets. We shall consider multiple learning algorithms when analyzing a dataset.

Acknowledgments

We would like to give a special “Thank You” to Dr. Daniel Mitchell for his guidance throughout this analysis.

References

[1] Resource: Bertsimas, D., & King, A. (2015). OR forum — An algorithmic approach to linear regression. Operations Research, 64(1), 2–16.

Github Links

--

--