Sentiment Analysis on Student Feedback in Engineering Education

19 min readJun 1, 2023

In the field of engineering education, student feedback plays a vital role in assessing the effectiveness of teaching methods, course materials, and overall learning experiences. Sentiment analysis, a key component of data science, offers a powerful approach to analyze and extract valuable insights from student feedback.

The objective of this project is to perform sentiment analysis on the feedback provided by 300 level computer engineering students in the University of Ilorin (my course mates). By making use of natural language processing (NLP) techniques and machine learning algorithms, we aim to uncover sentiments expressed in the feedback and gain a comprehensive understanding of student perceptions, satisfaction, and areas of improvement.

Through the analysis of student feedback, we can identify common themes, sentiment trends, and specific challenges faced by students. This valuable information can help inform the department and it’s lectureres about the effectiveness of their teaching methodologies, course content, and student support systems. The insights derived from sentiment analysis on student feedback can drive evidence-based decision-making in engineering education. It enables the department to address concerns, make improvements, and create a positive learning environment that caters to the needs of the students.

Data Collection

To get the data to use for this project, I utilized Google Forms to collect valuable feedback from students. The platform aided in the collection of diverse responses ensuring accuracy and efficiency in gathering student sentiments and provided a comprehensive dataset for the sentiment analysis project. Afterwards, I exported the collected data to a CSV format for the analysis.

Importing libraries and packages

# importing libraries and packages

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style='white')
from PIL import Image

import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')

# load the dataset and show the first 5 rows
df = pd.read_csv('Sentiment Analysis on Student Feedback.csv')
df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Timestamp               100 non-null    object
 1   Course Code             100 non-null    object
 2   Feedback                100 non-null    object
 3   Previous Experience     100 non-null    object
 4   Gender                  100 non-null    object
 5   Attendance              100 non-null    object
 6   Course Difficulty       100 non-null    object
 7   Study Hours (per week)  100 non-null    object
 8   Overall Satisfaction    100 non-null    int64 
 9   Department              100 non-null    object
 10  Unnamed: 11             3 non-null      object
dtypes: int64(1), object(10)
memory usage: 8.7+ KB

Data Cleaning

Here, I’m going to clean the dataset as it can be seen to have some quality issues.

The initial columns vs the columns when leading and trailing characters are removed.

# drop unncessary column
df = df.drop(['Unnamed: 11'], axis=1)

# Convert the column to datetime format
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

# Extract the date and time into separate columns
df['Date'] = df['Timestamp'].dt.date
df['Time'] = df['Timestamp'].dt.time

# drop Timestamp column
df = df.drop(['Timestamp'], axis=1)

# corrections to "Study Hours (per week) column"
df['Study Hours (per week)'] = df['Study Hours (per week)'].str.extract(r'(\d+)').fillna(0).astype(int)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Course Code             100 non-null    object
 1   Feedback                100 non-null    object
 2   Previous Experience     100 non-null    object
 3   Gender                  100 non-null    object
 4   Attendance              100 non-null    object
 5   Course Difficulty       100 non-null    object
 6   Study Hours (per week)  100 non-null    int32 
 7   Overall Satisfaction    100 non-null    int64 
 8   Department              100 non-null    object
 9   Date                    100 non-null    object
 10  Time                    100 non-null    object
dtypes: int32(1), int64(1), object(9)
memory usage: 8.3+ KB

Let’s now preview some random samples of the data.

Data Preprocessing

Cleaning and preprocessing the data by handling contractions, converting text to lower case removing stop words, punctuations, hashtags, numbers/digits and special characters and then tokenizing and lemmatizing the text.

# Function to handle contractions
def handle_contractions(text):
    contractions = {
        "n't": " not",
        "'s": " is",
        "'re": " are",
        "'ve": " have",
        "'d": " would",
        "'ll": " will",
        "'m": " am"
    }

    words = text.split()
    for i in range(len(words)):
        if words[i] in contractions:
            words[i] = contractions[words[i]]
    return ' '.join(words)

# Function to preprocess text data
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove URLs, hashtags, mentions, and special characters
    text = re.sub(r"http\S+|www\S+|@\w+|#\w+", "", text)
    text = re.sub(r"[^\w\s]", "", text)

    # Remove numbers/digits
    text = re.sub(r'\b[0-9]+\b\s*', '', text)
#     text = re.sub(r'\d+', '', text)

    # Remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    # Convert to lowercase
    tokens = [token.lower() for token in tokens]

    # Handle contractions
    text = handle_contractions(' '.join(tokens))

    # Lemmatize the words
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in word_tokenize(text)]

    # Join tokens back into a single string
    processed_text = ' '.join(tokens)

    return processed_text

# Apply preprocessing to the 'Feedback' column
df['Processed_Feedback'] = df['Feedback'].apply(preprocess_text)

First 10 rows of the “Processed” feedback column.

Language Detection

Detecting the type of language used in the feedback text.

from langdetect import detect

def detect_language(text):
    try:
        lang = detect(text)
        return lang
    except:
        return None
    
df['Language'] = df['Processed_Feedback'].apply(detect_language)
df['Language'].unique()

array(['en', 'cy', 'so', 'sk', 'af', 'fr', 'hr', 'id', 'pt', 'de', 'pl',
       'es'], dtype=object)

language_mapping = {
    'en': 'English',
    'cy': 'Welsh',
    'so': 'Somali',
    'sk': 'Slovak',
    'af': 'Afrikaans',
    'fr': 'French',
    'hr': 'Croatian',
    'id': 'Indonesian',
    'pt': 'Portuguese',
    'it': 'Italian',
    'pl': 'Polish',
    'es': 'Spanish'
}

df['Language'] = df['Language'].map(language_mapping)
df['Language'].unique()

array(['English', 'Welsh', 'Somali', 'Slovak', 'Afrikaans', 'French',
       'Croatian', 'Indonesian', 'Portuguese', nan, 'Polish', 'Spanish'],
      dtype=object)

The language is not important in this analysis though as it is not consistent with the feedback text. So, I’ll drop the column.

df['Char_Count'] = df['Processed_Feedback'].apply(len) # can also use df['Processed_Feedback'].str.len()
df['Word_Count'] = df['Processed_Feedback'].apply(lambda x: len(x.split()))
df = df.drop(['Language'], axis=1)

Sentiment scores and Labels

Calculating the sentiment scores and its corresponding labels.

Important Note

In the context of sentiment analysis, subjectivity scores can help distinguish between subjective statements that reflect personal opinions or emotions and objective statements that convey factual information. A high subjectivity score indicates a greater level of personal bias or opinion, while a low subjectivity score suggests a more objective or factual nature of the text.

Subjectivity is an important aspect to consider alongside polarity (sentiment) analysis, as it provides additional context and granularity in understanding the nature of the text and the subjective or objective nature of the statements being analyzed. The interpretation of subjectivity scores depends on the specific context and objective of your analysis. In general, a high subjectivity score indicates a greater degree of personal opinion or bias expressed in the text. This can be valuable if you are interested in capturing and analyzing subjective or emotional content, such as in sentiment analysis or opinion mining.

However, if the goal is to analyze and classify objective or factual information, a low subjectivity score would be more desirable. A low subjectivity score suggests that the text contains more objective statements that are based on facts or present information without personal opinion or bias.

# Calculate sentiment scores
df['Sentiment_Score'] = df['Processed_Feedback'].apply(lambda x: TextBlob(x).sentiment.polarity)

# Calculate subjectivity scores
df['Subjectivity_Score'] = df['Processed_Feedback'].apply(lambda x: TextBlob(x).sentiment.subjectivity)

# Map sentiment scores to sentiment labels
df['Sentiment_Label'] = df.apply(lambda row: 'Positive' if row['Sentiment_Score'] > 0 and row['Subjectivity_Score'] > 0.5 else 
                                        'Negative' if row['Sentiment_Score'] < 0 and row['Subjectivity_Score'] > 0.5 else 'Neutral', axis=1)

Aspect-Based sentiment Analysis Metrics

Summary Statistics and Metrics

# Sentiment Analysis Metrics
sentiment_counts = df['Sentiment_Label'].value_counts()
average_sentiment_score = df['Sentiment_Score'].mean()
average_subj_score = df['Subjectivity_Score'].mean()

# Descriptive Statistics
study_hours_stats = df['Study Hours (per week)'].describe()
overall_satisfaction_stats = df['Overall Satisfaction'].describe()

# Categorical Metrics
course_code_counts = df['Course Code'].value_counts()
department_counts = df['Department'].value_counts()
sentiment_distribution = df.groupby('Course Code')['Sentiment_Label'].value_counts(normalize=True)

# Print the calculated metrics
print("Sentiment Analysis Metrics:")
print(sentiment_counts)
print("Average Sentiment Score:", average_sentiment_score)
print("Average SUbjectivity Score:", average_subj_score)
print("\nDescriptive Statistics - Study Hours:")
print(study_hours_stats)
print("\nDescriptive Statistics - Overall Satisfaction:")
print(overall_satisfaction_stats)
print("\nCategorical Metrics - Course Code Counts:")
print(course_code_counts)
print("\nCategorical Metrics - Department Counts:")
print(department_counts)
print("\nSentiment Distribution by Course Code:")
print(sentiment_distribution)

Sentiment Analysis Metrics:
Neutral     42
Positive    34
Negative    24
Name: Sentiment_Label, dtype: int64
Average Sentiment Score: 0.04988879870129869
Average Subjectivity Score: 0.5088712121212121
Average Length of Feedback: 31.8

Descriptive Statistics - Study Hours:
count    100.000000
mean       8.310000
std        5.506094
min        0.000000
25%        4.000000
50%        8.000000
75%       12.000000
max       21.000000
Name: Study Hours (per week), dtype: float64

Descriptive Statistics - Overall Satisfaction:
count    100.000000
mean       5.100000
std        3.599944
min        0.000000
25%        1.000000
50%        5.000000
75%        9.000000
max       10.000000
Name: Overall Satisfaction, dtype: float64

Categorical Metrics - Course Code Counts:
CPE 321    31
CPE 311    13
CPE 341    13
CPE 381    12
CPE 331    11
MEE 361    10
GSE 301    10
Name: Course Code, dtype: int64

Categorical Metrics - Department Counts:
Yes    99
No      1
Name: Department, dtype: int64

Sentiment Distribution by Course Code:
Course Code  Sentiment_Label
CPE 311      Positive           0.769231
             Neutral            0.153846
             Negative           0.076923
CPE 321      Negative           0.516129
             Neutral            0.387097
             Positive           0.096774
CPE 331      Positive           0.727273
             Neutral            0.181818
             Negative           0.090909
CPE 341      Neutral            0.538462
             Negative           0.230769
             Positive           0.230769
CPE 381      Neutral            0.666667
             Negative           0.166667
             Positive           0.166667
GSE 301      Neutral            0.600000
             Positive           0.400000
MEE 361      Neutral            0.500000
             Positive           0.400000
             Negative           0.100000
Name: Sentiment_Label, dtype: float64

Analyzing the frequency of specific keywords or phrases in the feedback.

# analyze the frequency of specific keywords or phrases in the feedback
from collections import Counter

# The keywords or phrases of interest
keywords = ['shit', 'difficult', 'terrible', 'okay', 'best', 'worst', 'good', 'try']

# Concatenate all the preprocessed feedback into a single string
all_feedback = ' '.join(df['Processed_Feedback'])

# Tokenize the text into individual words
tokens = all_feedback.split()

# Count the frequency of each keyword in the feedback
keyword_frequency = Counter(tokens)

# Print the frequency of each keyword
for keyword in keywords:
    print(f"Frequency of '{keyword}': {keyword_frequency[keyword]}")

Frequency of 'shit': 1
Frequency of 'difficult': 4
Frequency of 'terrible': 5
Frequency of 'okay': 3
Frequency of 'best': 3
Frequency of 'worst': 2
Frequency of 'good': 8
Frequency of 'try': 1

Text clustering to group similar feedback together

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Apply TF-IDF vectorization to the processed feedback text
tfidf_matrix = vectorizer.fit_transform(df['Processed_Feedback'])

# Perform K-means clustering
num_clusters = 3  # Specify the desired number of clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(tfidf_matrix)

# Assign cluster labels to the feedback data
df['Cluster'] = kmeans.labels_

# Apply dimensionality reduction using PCA
pca = PCA(n_components=2)
reduced_features = pca.fit_transform(tfidf_matrix.toarray())

# Plot the clusters
plt.figure(figsize=(8, 6))
plt.scatter(reduced_features[:, 0], reduced_features[:, 1], c=df['Cluster'], cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('K-means Clustering Visualization')
plt.colorbar()
plt.show()

# Print the top terms for each cluster
print("Top terms per cluster:")
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(num_clusters):
    print(f"Cluster {i}:")
    for ind in order_centroids[i, :10]:
        print(f"  {terms[ind]}")
    print()

Top terms per cluster:
Cluster 0:
  nice
  course
  teaching
  scientist
  lecturer
  easy
  method
  job
  way
  go

Cluster 1:
  lecturer
  good
  teaching
  terrible
  course
  method
  like
  bad
  akanni
  class

Cluster 2:
  course
  stress
  easy
  well
  hard
  awful
  unit
  wahala
  plus
  dey

Topic Modeling

Implementing the topic modeling technique, Latent Dirichlet Allocation (LDA) to identify underlying topics or themes in the feedback data. This can provide deeper insights into the content and help analyze sentiment within specific topics.

# Create a CountVectorizer
vectorizer = CountVectorizer(max_features=1000, lowercase=True, stop_words='english', ngram_range=(1, 2))

# Apply CountVectorizer to the processed feedback text
dtm = vectorizer.fit_transform(df['Processed_Feedback'])

# Perform LDA topic modeling
num_topics = 10  # Specify the desired number of topics
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda.fit(dtm)

# Get the top words for each topic
feature_names = vectorizer.get_feature_names()
top_words = 10  # Specify the number of top words to retrieve for each topic
for topic_idx, topic in enumerate(lda.components_):
    print(f"Topic {topic_idx}:")
    print(" ".join([feature_names[i] for i in topic.argsort()[:-top_words - 1:-1]]))
    print()

Topic 0:
course lecturer difficult lecturer difficult hard time isnt teaching course difficult man

Topic 1:
course good lecturer nice terrible nice course easy course easy awful lecturer good

Topic 2:
course nice teaching lecturer method teaching method stress method lecturer nice teaching cool

Topic 3:
teaching code terrible course method teaching method bad love love code course taught

Topic 4:
time bad okay teaching teaching mode mode taught lecturer revision revision whats

Topic 5:
course make class dey especially dey make experience easy man pas

Topic 6:
akanni cool way god teach know really dry class dry know teach

Topic 7:
course easy awful wahala course hard hard good lecturer unit course unit

Topic 8:
lecturer course class course lecturer hate man wey plus lecturer good dey

Topic 9:
course lecturer god class student sha teaching coding like understand

Emotion Detection

Identifying emotions in student feedback. The sentiment property of the TextBlob object to retrieve the sentiment scores which includes polarity (a value between -1 and 1 indicating the sentiment) and subjectivity (a value between 0 and 1 indicating the subjectivity of the text).
Emotion Polarity: Emotion polarity measures the sentiment or emotional tone of a text. It indicates whether the text expresses a positive, negative, or neutral emotion. In the code provided, the polarity scores are obtained using the SentimentIntensityAnalyzer from NLTK. The polarity scores include values for positive, negative, and neutral sentiment. The sentiment polarity can help identify the overall sentiment or emotional tone of the feedback text.
Emotion Subjectivity: Emotion subjectivity measures the degree of subjectivity or objectivity in the expression of emotions in a text. It indicates how much the text relies on personal opinions, beliefs, or experiences rather than factual or objective information. A higher subjectivity score suggests that the text is more influenced by personal perspectives or experiences.

def calculate_emotions(text):
    blob = TextBlob(text)
    emotion_scores = blob.sentiment.polarity, blob.sentiment.subjectivity
    return emotion_scores

# Apply emotion analysis to the feedback text
df['Emotion_Scores'] = df['Processed_Feedback'].apply(calculate_emotions)

# Extract emotion scores for each emotion category
df['Emotion_Polarity'] = df['Emotion_Scores'].apply(lambda x: x[0])

# assign emotion labels based on polarity values
df['Emotion_Label'] = df['Emotion_Polarity'].apply(lambda x: 'Positive' if x > 0 else 'Negative' if x < 0 else 'Neutral')

# the resulting dataframe with emotion scores and labels
df[['Processed_Feedback', 'Emotion_Polarity', 'Emotion_Label']].head()

Exploratory Data Analysis

Creating meaningful visualizations to gain insights and communicate findings effectively. Exploring different types of plots, charts, and graphs to showcase various aspects of the data and also analyzing the distribution of sentiment labels in the data to understand the overall sentiment polarity.

Correlation Analysis

Exploring the correlation between sentiment and other variables in the dataset to identify potential relationships.

correlation_matrix = df[['Study Hours (per week)', 'Overall Satisfaction']].corr()

cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap=cmap, vmin=-1, vmax=1, linewidths=0.5)
plt.title('Correlation between Study Hours and Overall Satisfaction')

for i in range(correlation_matrix.shape[0]):
    for j in range(correlation_matrix.shape[1]):
        if i != j:
            text = '{:.2f}'.format(correlation_matrix.iloc[i, j])
            plt.text(j + 0.5, i + 0.5, text, ha='center', va='center', color='black')

colorbar = plt.gca().collections[0].colorbar
colorbar.set_ticks([-1, -0.5, 0, 0.5, 1])
colorbar.set_ticklabels(['Strong Negative', 'Negative', 'Neutral', 'Positive', 'Strong Positive'])

plt.xlabel('Features')
plt.ylabel('Features')
plt.show()

Correlation between study hours and overall satisfaction.

correlation_matrix = df[['Sentiment_Score', 'Overall Satisfaction']].corr()

cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap=cmap, vmin=-1, vmax=1, linewidths=0.5)
plt.title('Correlation between Sentiment Score and Satisfaction')

for i in range(correlation_matrix.shape[0]):
    for j in range(correlation_matrix.shape[1]):
        if i != j:
            text = '{:.2f}'.format(correlation_matrix.iloc[i, j])
            plt.text(j + 0.5, i + 0.5, text, ha='center', va='center', color='black')

colorbar = plt.gca().collections[0].colorbar
colorbar.set_ticks([-1, -0.5, 0, 0.5, 1])
colorbar.set_ticklabels(['Strong Negative', 'Negative', 'Neutral', 'Positive', 'Strong Positive'])

plt.xlabel('Features')
plt.ylabel('Features')
plt.show()

# Bar plot for Course Code
plt.figure(figsize=(10, 6))
color = sns.color_palette()[0]
order = df['Course Code'].value_counts().index
ax = sns.countplot(data=df, x='Course Code', color=color, order=order)
plt.xlabel('Course Code')
plt.ylabel('Count of Feedback')
plt.title('Feedback Count by Course Code')
plt.xticks(rotation=45)
ax.bar_label(ax.containers[0], fmt='%.0f', label_type='edge')
plt.show()

# Word cloud for Overall Feedback: Combine all feedback into a single string
all_feedback = ' '.join(df['Processed_Feedback'])

plt.figure(figsize=(10, 6))
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_feedback)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Overall Feedback')
plt.show()

# Word cloud for Positive Feedback
data = ' '.join(df[df['Processed_Feedback'] == 'Positive']['Processed_Feedback'])

plt.figure(figsize=(10, 6))
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_feedback)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Positive Feedback')
plt.show()

# Word cloud for Negative Feedback
data = ' '.join(df[df['Processed_Feedback'] == 'Negative']['Processed_Feedback'])

plt.figure(figsize=(10, 6))
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_feedback)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Negative Feedback')
plt.show()

# Word cloud for Neutral Feedback
data = ' '.join(df[df['Processed_Feedback'] == 'Neutral']['Processed_Feedback'])

plt.figure(figsize=(10, 6))
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_feedback)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Neutral Feedback')
plt.show()

# Bar plot for Sentiment
plt.figure(figsize=(8, 6))
color = sns.color_palette()[0]
order = df['Sentiment_Label'].value_counts().index
ax = sns.countplot(data=df, x='Sentiment_Label', color=color, order=order)
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.title('Distribution of Sentiments')
ax.bar_label(ax.containers[0], fmt='%.0f', label_type='edge')
plt.show()

# Bar plot for Previous Experience
plt.figure(figsize=(8, 6))
color = sns.color_palette()[0]
ax = sns.countplot(data=df, x='Previous Experience', color=color)
plt.xlabel('Previous Experience')
plt.ylabel('Count')
plt.title('Feedback Count by Previous Experience')
ax.bar_label(ax.containers[0], fmt='%.0f', label_type='edge')
plt.show()

# Pie chart for Gender distribution
counts = df['Gender'].value_counts()
labels = [f"{gender}\n{count / len(df) * 100:.1f}%" for gender, count in counts.items()]

fig, ax = plt.subplots()
ax.pie(counts, labels=labels, startangle=50, counterclock=False, pctdistance=0.8, labeldistance=1.2)
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig.gca().add_artist(centre_circle)
ax.set_title('Gender Distribution', fontsize=16, loc='left', pad=30)
ax.axis('equal')
plt.show()

# Pie chart for Attendance
counts = df['Attendance'].value_counts()
labels = [f"{attendance}\n{count / len(df) * 100:.1f}%" for attendance, count in counts.items()]

fig, ax = plt.subplots()
ax.pie(counts, labels=labels, startangle=50, counterclock=False, pctdistance=0.8, labeldistance=1.2)
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig.gca().add_artist(centre_circle)
ax.set_title('Distribution of Attendance', fontsize=16, loc='left', pad=30)
ax.axis('equal')
plt.show()

# Bar plot for Course Difficulty
plt.figure(figsize=(10, 6))
color = sns.color_palette()[0]
order = ['Easy', 'Moderate', 'Challenging', 'Difficult']
ax = sns.countplot(data=df, x='Course Difficulty', color=color, order=order)
plt.xlabel('Course Difficulty')
plt.ylabel('Count of Feedback')
plt.title('Feedback Count by Course Difficulty')
ax.bar_label(ax.containers[0], fmt='%.0f', label_type='edge')
plt.show();

# Histogram for Study Hours (per week)
plt.figure(figsize=(10, 6))
color = sns.color_palette()[0]
ax = sns.histplot(data=df, x='Study Hours (per week)', bins=20, color=color)
plt.xlabel('Study Hours (per week)')
plt.ylabel('Count of Students')
plt.title('Distribution of Study Hours')
ax.bar_label(ax.containers[0], fmt='%.0f', label_type='edge')
plt.show()

# Histogram for Overall Satisfaction
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='Overall Satisfaction', bins=30)
plt.xlabel('Overall Satisfaction')
plt.ylabel('Count of Students')
plt.title('Distribution of Overall Satisfaction')
plt.show()

# Word Frequency Analysis
from collections import Counter
word_frequency = Counter(" ".join(df['Processed_Feedback']).split()).most_common(30)
plt.figure(figsize=(20, 10))
color = sns.color_palette()[0]
ax = sns.barplot(x=[word[1] for word in word_frequency], y=[word[0] for word in word_frequency], color=color)
ax.bar_label(ax.containers[0], fmt='%.0f', label_type='edge')
plt.xlabel('Frequency')
plt.ylabel('Word')
plt.title('Top 30 Most Frequent Words')
plt.show()

# Sentiment Box Plots
plt.figure(figsize=(10, 6))
color = sns.color_palette()[0]
sns.boxplot(data=df, x='Course Code', y='Sentiment_Score', color=color)
plt.xlabel('Course Code')
plt.ylabel('Sentiment Score')
plt.title('Sentiment Distribution by Course Code')
plt.xticks(rotation=45)
plt.show()

# Bar plot for Course Difficulty
plt.figure(figsize=(10, 6))
hue_order = ['Easy', 'Moderate', 'Challenging', 'Difficult']
sns.countplot(data=df, x='Course Code', hue='Course Difficulty', palette='Blues_r', hue_order=hue_order)
plt.xlabel('Course Difficulty')
plt.ylabel('Count of Feedback')
plt.title('Feedback Count by Course Difficulty')
plt.legend(loc=1)
plt.show();

# Bar plot for Course Code distribution by Sentiment distribution
plt.figure(figsize=(10, 6))
hue_order = ['Positive', 'Neutral', 'Negative']
sns.countplot(data=df, x='Course Code', hue='Sentiment_Label', palette='Blues_r', hue_order=hue_order)
plt.xlabel('Course Code')
plt.ylabel('Count of Feedback')
plt.title('Course Code distribution by Sentiment distribution')
plt.legend(loc=1)
plt.show();

# Sentiment Distribution by Course Difficulty
plt.figure(figsize=(10, 6))
hue_order = ['Positive', 'Neutral', 'Negative']
order = ['Easy', 'Moderate', 'Challenging', 'Difficult']
sns.countplot(data=df, x='Course Difficulty', hue='Sentiment_Label', 
              palette='Blues_r', hue_order=hue_order, order=order)
plt.xlabel('Course Difficulty')
plt.ylabel('Count of Feedback')
plt.title('Sentiment Distribution by Course Difficulty')
plt.show()

# Sentiment Distribution by Gender
plt.figure(figsize=(10, 6))
hue_order = ['Positive', 'Neutral', 'Negative']
sns.countplot(data=df, x='Gender', hue='Sentiment_Label', hue_order=hue_order, palette='Blues_r')
plt.xlabel('Gender')
plt.ylabel('Count of Feedback')
plt.title('Sentiment Distribution by Gender')
plt.show()

# Word Count distribution by course difficulty
plt.figure(figsize=(10, 6))
order = ['Easy', 'Moderate', 'Challenging', 'Difficult']
color = sns.color_palette()[0]
sns.boxplot(data=df, x='Course Difficulty', y='Word_Count', color=color, order=order)
plt.xlabel('Course Difficulty')
plt.ylabel('Word Count')
plt.title('Distribution of Word Count for different levels of Course Difficulty')
plt.show()

# Distribution of Study Hours (per week) and Overall Satisfaction
plt.figure(figsize=(10, 6))
color = sns.color_palette()[0]
sns.lineplot(data=df, x='Study Hours (per week)', y='Overall Satisfaction', color=color, ci=None)
plt.xlabel('Study Hours (per week)')
plt.ylabel('Overall Satisfaction')
plt.title('Distribution of Study Hours (per week) and Overall Satisfaction')
plt.show()

# Sentiment vs. Overall Satisfaction
plt.figure(figsize=(10, 6))
color = sns.color_palette()[0]
sns.scatterplot(x='Sentiment_Score', y='Overall Satisfaction', data=df, color=color)
plt.xlabel('Sentiment Score')
plt.ylabel('Overall Satisfaction')
plt.title('Sentiment score vs. Overall Satisfaction')
plt.xticks(np.arange(-1, 1.1, 0.5))
plt.yticks(np.arange(0, 11))
plt.grid(True)
plt.show()

# Sentiment score Distribution by Course code
plt.figure(figsize=(10, 6))
color = sns.color_palette()[0]
sns.relplot(data=df, x='Course Code',y = 'Sentiment_Score', color=color, kind='scatter')
plt.xlabel('Course code')
plt.ylabel('Sentiment Score')
plt.title('Sentiment score vs Course code distribution')
plt.xticks(rotation=45)
plt.show();

# correlation matrix of all variables in the data
correlation_matrix = df.corr()

plt.figure(figsize=[20, 6])
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap=cmap, vmin=-1, vmax=1, linewidths=0.5)
plt.title('Correlation between variables in the dataset')

for i in range(correlation_matrix.shape[0]):
    for j in range(correlation_matrix.shape[1]):
        if i != j:
            text = '{:.2f}'.format(correlation_matrix.iloc[i, j])
            plt.text(j + 0.5, i + 0.5, text, ha='center', va='center', color='black')

colorbar = plt.gca().collections[0].colorbar
colorbar.set_ticks([-1, -0.5, 0, 0.5, 1])
colorbar.set_ticklabels(['Strong Negative', 'Negative', 'Neutral', 'Positive', 'Strong Positive'])

plt.xlabel('Features')
plt.ylabel('Features')
plt.show()

sns.set(style='ticks')
sns.pairplot(data=df, vars=['Study Hours (per week)', 'Overall Satisfaction', 
                            'Sentiment_Score'], hue='Previous Experience', markers='o')
plt.suptitle('Study Hours (per week), Overall Satisfaction and Sentiment Score Distributions by Previous Experience',
             y=1.08)
plt.show();

Machine Learning Model

I then built a ML model using XGBoost Classifier for predicting sentiment labels.

XGBoost Classifier

X = df['Processed_Feedback']
y = df['Sentiment_Label']

print(X.shape, y.shape)

#(100,) (100,)

le = LabelEncoder()
y_encoded = le.fit_transform(y)
print('Encoded Target Labels:')
print(y_encoded, '\n')

# get mapping for each label
le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print('Label Mappings:')
print(le_name_mapping)

Encoded Target Labels:
[1 1 0 1 1 1 2 1 1 2 1 0 1 0 1 2 0 0 0 1 1 2 0 0 1 1 2 1 1 0 2 1 1 1 1 2 0
 2 0 1 2 1 0 2 2 0 2 0 1 0 1 0 1 0 2 1 2 1 1 1 1 2 0 2 1 1 1 2 2 0 2 1 1 2
 2 0 2 1 0 2 2 0 2 2 0 2 1 1 1 2 2 1 2 1 2 2 0 1 2 2] 

Label Mappings:
{'Negative': 0, 'Neutral': 1, 'Positive': 2}

Using random train and test subsets.

# Spliting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# Preprocessor
preprocessor = Pipeline([
    ('bow', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
])

# XGBoost Classifier
xgb_classifier = xgb.XGBClassifier(
    learning_rate=0.1,
    max_depth=6,
    n_estimators=80,
    use_label_encoder=False,
    objective='multi:softmax',
    eval_metric='merror',
    num_class=3
)

# Pipeline
pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', xgb_classifier),
])

# Hyperparameter Tuning
param_grid = {
    'model__learning_rate': [0.1, 0.01, 0.001],
    'model__max_depth': [6, 8, 10],
    'model__n_estimators': [80, 100, 120],
}
grid_search = GridSearchCV(pipe, param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print('Best parameters:', best_params)

# Fit and Evaluate on Testing Set
pipe.set_params(**best_params)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print('Testing Accuracy:', acc)

Best parameters: {'model__learning_rate': 0.01, 'model__max_depth': 6, 'model__n_estimators': 80}
Testing Accuracy: 0.75

Using Cross Validation.

# Preprocessor
preprocessor = Pipeline([
    ('bow', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
])

# XGBoost Classifier
xgb_classifier = xgb.XGBClassifier(
    learning_rate=0.1,
    max_depth=6,
    n_estimators=80,
    use_label_encoder=False,
    objective='multi:softmax',
    eval_metric='merror',
    num_class=3
)

# Pipeline
pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', xgb_classifier),
])

# Hyperparameter Tuning with Cross-Validation
param_grid = {
    'model__learning_rate': [0.1, 0.01, 0.001],
    'model__max_depth': [6, 8, 10],
    'model__n_estimators': [80, 100, 120],
}
grid_search = GridSearchCV(pipe, param_grid, cv=5)
grid_search.fit(X, y_encoded)
best_params = grid_search.best_params_
print('Best parameters:', best_params)

# Fit and Evaluate using Cross-Validation
pipe.set_params(**best_params)
cv_scores = cross_val_score(pipe, X, y_encoded, cv=5)
mean_cv_score = cv_scores.mean()
print('Cross-validation accuracy:', mean_cv_score)

Best parameters: {'model__learning_rate': 0.01, 'model__max_depth': 8, 'model__n_estimators': 100}
Cross-validation accuracy: 0.6799999999999999

I also created a Power BI report to communicate my findings which can be seen in the image below.

Power BI report to communicate insights.

Conclusion

In conclusion, the sentiment analysis of student feedback in engineering education has yielded valuable insights and recommendations for improvement. The sentiment distribution indicates a majority of Neutral feedback (suggesting a balanced perspective or lack of strong sentiment towards their educational experience) followed by Positive and Negative sentiments. The prevalence of Neutral sentiments in the student feedback sentiment analysis may also indicate that students are providing objective observations or factual statements without expressing a clear positive or negative sentiment. Furthermore, it was observed that a majority of students had no previous experience, adding to the context of the analysis.

Gender disparity shows both male and female students expressing negative sentiments, with females expressing more positive sentiments and males having more neutral sentiments which highlight the importance of considering gender as a factor in understanding and addressing the sentiment dynamics in student feedback.

Variation across courses highlights specific strengths or areas for improvement, with CPE 321 being the most challenging. CPE 341 and CPE 311 received lower sentiment scores, while CPE 311 had the highest sentiment score. It was also noted that the easy courses had the most number of positive sentiments. Correlations reveal the alignment of sentiment score with overall satisfaction and perfect correlation with emotion polarity.

Additionally, the high correlation between study hours and overall satisfaction implies that the amount of time students dedicate to studying may positively influence their overall satisfaction with the courses they take.

Topic modeling uncovers key themes discussed by students. The sentiment analysis serves as a foundation for continuous improvement in engineering education with targeted interventions required for courses with more negative sentiments, particularly CPE 321 ensuring a fulfilling and satisfactory learning conditions for the students.

Lastly, I built a model which achieved impressive accuracy rates of 68% in cross-validation and 75% in testing. This project has deepened my understanding of ML and fueled my passion for using technology to understand and predict human emotions.

Thanks for reading🤓. Check out my GitHub repo where you can find the code and resources related to this project and explore other projects I have worked on. You can also interact with the Power BI Dashboard here. If you found this analysis insightful and informative, please show your support by liking, commenting, and following. Your feedback and suggestions are valuable and will help me improve and deliver more meaningful content in the future. Cheers!