Scikit-Learn: A silver bullet for basic machine learning

Manikandan Jeeva
Oct 6, 2018 · 11 min read
Scikit-Learn general process flow

Most of the Scikit-Learn modules follow the same steps.

What this package is not meant to be

Example of a prediction problem; Using the inbuild cancer dataset

Cancer dataset pairplot using seaborn
Output dataset generated by make_blobs sample generator
X_train, X_test, y_train, y_test =    train_test_split(cancer_data_pd[cancer_data_dict.feature_names],
cancer_data_dict['target'],
test_size=0.20,
stratify=cancer_data_dict['target'],
random_state=111,
shuffle=True);
INFO - X_train.shape : (455, 30)
INFO - X_test.shape : (114, 30)
INFO - Y_train.shape : (455,)
INFO - Y_test.shape : (114,)
dummy_classifier = DummyClassifier(strategy="most_frequent");
def cost_accuracy(actual, prediction):
""" Custom accuracy cost function to be used in the scorer """;
# accuracy = correct predictions / total predictions
assert len(actual) == len(prediction);
return round((np.sum(actual == prediction) / len(actual)) , 4);
accuracy_scorer = make_scorer(cost_accuracy, greater_is_better=True);
kfold = model_selection.KFold(n_splits=10, random_state=111);
results = model_selection.cross_val_score(dummy_classifier, X_train, y_train, cv=kfold, scoring=accuracy_scorer);
std_scaler = preprocessing.MinMaxScaler(); 
std_scaler = std_scaler.fit(X_train);
scaled_X_train = pd.DataFrame(std_scaler.transform(X_train), columns=X_train.columns);
selectKbest_est = SelectKBest(f_classif, k=8);
selectKbest_X_train = selectKbest_est.fit_transform(X_train, y_train);
poly = preprocessing.PolynomialFeatures(degree=2, include_bias=False, interaction_only=False);

X_train_poly = poly.fit_transform(X_train);
X_train_p2 = pd.DataFrame(X_train_poly,
columns=poly.get_feature_names(X_train.columns));
kernel_param = ('rbf', 1);
kpca = KernelPCA(n_components=4,
kernel=kernel_param[0],
gamma=kernel_param[1],
fit_inverse_transform=True,
random_state=111)
kpca.fit(scaled_X_train); # The data has to be scaled;
kpca_X_train = kpca.transform(scaled_X_train);
tuning_parameters = [{'n_estimators' : [1, 10],
'max_depth' : [10, 20],
'max_features' : [0.80, 0.40],
'random_state' : [111]
}];
clf = GridSearchCV(RandomForestClassifier(),
tuning_parameters,
cv=5,
scoring=accuracy_scorer);
clf.fit(X_train, y_train);
class ColumnTypeFilter(BaseEstimator, TransformerMixin):
""" Custom transformer to select all columns of a particular type in a pandas dataframes """;
def __init__(self, dtype):
self.dtype = dtype;
def fit(self, X, y=None):
return self;
def transform(self, X):
assert isinstance(X, pd.DataFrame);
return X.select_dtypes(include=[self.dtype]);
ctf = ColumnTypeFilter(np.number);
ctf.fit_transform(X_train).head();
custom_pipeline = make_pipeline(
FeatureUnion(transformer_list=[
('StdScl', make_pipeline(
ColumnTypeFilter(np.number),
preprocessing.StandardScaler()
)),
('MMScl', make_pipeline(
ColumnTypeFilter(np.number),
preprocessing.MinMaxScaler()
))
])
);
ensemble_clf = VotingClassifier(estimators=[
('dummy', dummy_classifier),
('logistic', lr),
('rf', RandomForestClassifier())],
voting='soft');
ensemble_clf.fit(X_train, y_train);
ensemble_clf_accuracy_ = cost_accuracy(y_test,
ensemble_clf.predict(X_test));
baby_names = ['Ava', 'Lily', 'Noah', 'Jacob', 'Mia', 'Sophia'];
X_train_list = [ np.random.choice(baby_names) for i in range(40) ];
X_test_list = [ np.random.choice(baby_names) for i in range(6) ];
bb_labelencoder = preprocessing.LabelEncoder();
bb_labelencoder.fit(X_train_list);
bb_encoded = bb_labelencoder.transform(X_test_list);
bb_onehotencoder = preprocessing.OneHotEncoder(sparse=False);
bb_encoded = bb_encoded.reshape(len(bb_encoded), 1);
bb_onehot = bb_onehotencoder.fit_transform(bb_encoded);
Actual : Ava | LabelEncoded : 0 | OneHot : [ 1. 0. 0. 0.]
Actual : Ava | LabelEncoded : 0 | OneHot : [ 1. 0. 0. 0.]
Actual : Noah | LabelEncoded : 4 | OneHot : [ 0. 0. 0. 1.]
Actual : Mia | LabelEncoded : 3 | OneHot : [ 0. 0. 1. 0.]
Actual : Lily | LabelEncoded : 2 | OneHot : [ 0. 1. 0. 0.]
Actual : Lily | LabelEncoded : 2 | OneHot : [ 0. 1. 0. 0.]
corpus = ['This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?', ]
vectorizer = CountVectorizer();
X = vectorizer.fit_transform(corpus);
cntvector_out = pd.DataFrame(X.toarray(),
columns=vectorizer.get_feature_names());
Input text : This is the first document.
Output counter vector : This is the first document.
and 0
document 1
first 1
is 1
one 0
second 0
the 1
third 0
this 1

Happy coding and keep learning.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Manikandan Jeeva

Written by

Data science enthusiast. Python & SAS addict who solves key business problems in pharma commercial analytics. Asst Vice President @ Genpact | genpact.com

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com