A Practical Guide to SEMMA in Data Science

Saipraneethk18
4 min readSep 30, 2023

--

In the vast realm of data science and analytics, SEMMA is a structured process that offers a systematic approach for projects. Standing for Sample, Explore, Modify, Model, and Assess, SEMMA provides a robust foundation for handling data and deriving meaningful insights.

In this article, we’ll walk through a hands-on application of SEMMA.

1. Sample

The very first step in the SEMMA process involves collecting or sampling the data that will be analyzed

In this step, we collect the data that we want to analyze. Depending on the project, this could mean gathering data from various sources such as databases, APIs, or even manual data entry.

columns:

Age (Range: 18–65)

Income (Range: 20,000−100,000)

Purchase Amount (Range: 0−10,000)

Gender (Male, Female)

Region (North, South, East, West)

2. Explore

Once we have our data, the next logical step is to explore it. This step allows us to understand the nature, distribution, and potential issues in the data.

import seaborn as sns
import matplotlib.pyplot as plt
# Summary statistics
summary_stats = df.describe()
# Visualization of distributions
sns.pairplot(df, hue="Gender")
plt.show()

Summary Statistics:

Age: Ranges from 18 to 65 with an average age of around 41.7.

Income: Ranges from 20,138 to 99,973 with an average income of approximately $60,067.

Purchase Amount: Ranges from 2 to 9,993 with an average purchase amount of around $4,890.

Gender: The dataset has a nearly balanced gender distribution, with slightly more females than males.

Region: The samples are distributed across four regions, with the South region having the highest frequency.

Visual Distributions:

The distributions of Age, Income, and Purchase Amount are shown as histograms. The distributions of Gender and Region are shown as bar charts.

Missing Values: No missing values were found in any columns.

3. Modify

After gaining a basic understanding of the data, it’s time to prepare it for modeling. This involves handling anomalies, encoding categorical variables, and potentially normalizing or transforming certain features.

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
# Normalize "Income" and "Purchase Amount"
scaler = MinMaxScaler()
df[['Income', 'Purchase Amount']] = scaler.fit_transform(df[['Income', 'Purchase Amount']])
# One-hot encode "Gender" and "Region"
encoder = OneHotEncoder(drop='first', sparse=False)
encoded_features = encoder.fit_transform(df[['Gender', 'Region']])
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names(['Gender', 'Region']))
# Merge the encoded columns to the dataset
df = pd.concat([df.drop(['Gender', 'Region'], axis=1), encoded_df], axis=1)

Data Modifications Done:

Normalization: The “Income” and “Purchase Amount” columns have been normalized to fall within the range [0, 1].

One-Hot Encoding: The “Gender” and “Region” columns have been one-hot encoded. This means:

  1. For the “Gender” column, we now have a column called Gender_Male where 1 indicates “Male” and 0 indicates “Female”.
  2. For the “Region” column, we’ve created three new columns (Region_North, Region_South, and Region_West). If all these columns are 0 for a particular row, it means the region is “East”. Otherwise, a 1 in any of these columns indicates the corresponding region for that row.

4. Model

With the data prepared, we now step into the modeling phase. Here, we select an appropriate algorithm to train a model on our data.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Split the data
X = df.drop("Purchase Amount", axis=1)
y = df["Purchase Amount"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

Modeling Actions Done:

The dataset was split into a training set (80%) and a testing set (20%). A linear regression model was trained using the training set. The trained model was then used to predict the “Purchase Amount” on the testing set.

5. Assess

Finally, after training the model, we evaluate its performance on unseen data. This step is crucial to understand the efficacy of our model.

# Predict on the testing set
y_pred = model.predict(X_test)
# Evaluate the model's performance
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Mean Absolute Error: {mae}")
print(f"Root Mean Squared Error: {rmse}")

Performance Evaluation:

Mean Absolute Error (MAE): This metric measures the average absolute difference between the actual and predicted values. Our model has an MAE of approximately 0.257. Given that our “Purchase Amount” values are normalized between 0 and 1, this MAE value indicates that on average, our predictions are off by about 25.7% of the range.

Root Mean Squared Error (RMSE): This metric gives an idea of the magnitude of the error. An RMSE of approximately 0.294 indicates the root mean squared difference between the actual and predicted values.

Conclusion:

The SEMMA methodology offers a structured approach to data analytics. Through this article, we’ve demonstrated how to apply each step in practice using Python. Always remember that data science is iterative, and it’s essential to revisit and refine your models continually for optimal results.

--

--