Insights into the Advanced Machine Learning course in the M.S. in Data Science program

Kevin Sun

Published in

USF-Data Science

8 min readMay 30, 2023

Meet Professor Cody Carroll and discover advanced machine learning.

Professor Cody Carroll

Cody Carroll, Assistant Professor in University of San Francisco

Meet Professor Cody Carroll, one of the newest additions to the USF M.S. in Data Science (MSDS) program faculty. Cody taught three different courses to MSDS Cohort 11, including Communication in Data Science, Machine Learning Lab, and Advanced Machine Learning. As a graduate of the University of California, Davis with a PhD and Master’s degree in statistics, he has conducted methodological research across multiple fields, including neuroscience, mountaineering, epidemiology, and veterinary medicine. In addition, Cody has teaching experience as an ESL teacher in Hyogo, Japan which helped shape his teaching style and made him one of the program’s favorite professors.

Professor Cody taught three classes and each progressively increased in difficulty based on the content of the course. Communication for Data Science is a fun, soft-skills building class that improves personal website, LinkedIn pages, resume check-ups, and communication skills in workspaces. Machine Learning Lab is a compatible course with Intro to Machine Learning, where we learn comprehensive concepts based on the Intro to Machine Learning class and work on various projects and exercises. The Advanced Machine Learning course is a more challenging course where we gain an understanding of and manipulate more difficult algorithms.

In addition to being an academic and intelligent individual, Professor Cody has a fun, adventurous, and approachable personality. He is also a professional DJ who plays in various locations, from wineries to clubs and parties, and his music style is a mix of tribal house, melodic techno, funk, and disco. He even creates Spotify playlists for each assignment to help students stay motivated while completing their tasks. Professor Cody’s friendly personality, combined with his diverse interests, has made him a popular instructor among students.

Professor Cody is not only an excellent teacher, but he also has a warm heart and deeply cares for his students. He goes above and beyond to prepare study guides, practice quizzes, and office hours for students, paying extra attention to those who need additional help. Another interesting fact about him is that he has adopted rabbits and cats. One of his cats, Red, was found in a storm shelter in Texas after a strong storm passed by. Red is an outdoor cat during the day but an indoor cat whenever Cody calls him in. Red is incredibly adorable, as you can see in the picture.

Advanced Machine Learning Course

Machine Learning Open License — Image Credits: IoT World Today

The Advanced Machine Learning course is a highly sought-after and challenging class within the Data Science Program. If you’re an incoming student curious about what to expect, here’s a breakdown of the structure, content, and benefits.

The course spans over 9 weeks, with twice-a-week meetings and a compacted curriculum packed with technical concepts and projects. You will dive deep into various topics, such as:

Dimension reduction using SVD and PCA
Recommendation algorithms such as collaborative filtering and matrix factorization
Boosting with Adaboost and gradient boosting
Neural networks with PyTorch
Applications for neural network

Let’s take the first topic, SVD and PCA, as an example of how the class is structured. We start by learning the definition and intuitive understanding of these techniques, their applications in Machine Learning, and in real-life scenarios. Then, the instructor will dive into the statistical and mathematical proofs for the algorithms, such as the process of how SVD and PCA achieve Dimension Reduction. Finally, we will work on a practical example, such as using SVD for a recommendation system to reduce the feature dimensions. After we have a good understanding, we will then implement this process in Python to solve.

The knowledge gained in this course applies directly to real-world situations. It is taught with a high-level explanation first, then with deep mathematical understanding and proofs. This combination of understanding both sides prepares students for success in technical interviews and applied the concept to work as the professionals. Deep Learning is a crucial component of the Data Science industry today, like ChatGPT. PyTorch is the dominant package for implementing neural networks in Python. Therefore, the curriculum is changed accordingly with introducing PyTorch and its applications. The curriculum updated annually to reflect the most relevant and critical topics, ensuring that students receive the most up-to-date knowledge.

Data Competition Project

In the Advanced Machine Learning course, Professor Cody challenges students with a fun and engaging final project — a Data Competition Project that is posted on Kaggle. The project involves working with a dataset that has masked information and limited domain knowledge. Throughout the 9-week course, students are grouped into teams of two and have three checkpoints to upload their predictions to Kaggle and get ranked based on their accuracy on the test set. With each checkpoint, the required accuracy increases, and students are encouraged to improve and develop better models using their ideas, skills, and implementation techniques.

The Data Competition Project is an excellent simulation of real-life problem-solving processes in data science. It requires students to work with limited information and test different models and methods to gradually improve their performance, similar to how data scientists tackle real-world problems. The patient monitor data used in the project was collected from a real hospital and contains anonymized numerical and categorical columns along with timestamp information. The goal of the project was to predict two target variables, Y_1 and Y_2. While the dependent variables were not initially disclosed, Professor Cody revealed that Y_1 represents blood pressure, and Y_2 represents heart rate after the project concluded.

The Data Competition Project allowed students to showcase their thought process and their ability to work effectively in a team. The winning teams’ presentations highlighted their strategies, techniques, and models used to achieve high accuracy on the test set. The project demonstrated the applicability of the skills and knowledge gained throughout the course in real-world scenarios. Here is the picture of all 6 winning teams. Congratulations!

Now let’s dive into some of the winning teams’ presentations and see how they solve this problem.

Guru Gopalakrishnan & Ensun Park (#TODO)

Guru Gopalakrishnan and Ensun Park’s winning approach to this project involved several key steps. They started with exploratory data analysis (EDA), feature engineering, and Y-transformation. For the model, they tried linear regression (LR), support vector regression (SVR), and linear SVR. At checkpoint one, they discovered that the aggregate LR means the data has a lot of noise. They then used the average last five features num 0–2 and t0–4 and dropped cat 0–4 based on OLS p-values, resulting in 92 total features for lasso and eight features for enet and DLR. They used five-fold cross-validation and achieved an accuracy of 3.4148. At checkpoint two, they used SVR and learned that it’s essential to choose the right model, and EDA is crucial. They also did research on domain knowledge, where they found that Google has a list of information, and ChatGPT gives a singular result. They concluded that data competitions are not easy and emphasized the importance of dividing the details. Finally, they highlighted the significance of time data, which is more critical due to noise and variation.

Matthew Wheeler & Varun Hande (Punishers)

In this project, Matt Wheeler and Varun Hande, also known as the Punishers, worked on some interesting feature engineering methods. They began by analyzing the time feature and found that T_1 was right-skewed and T_5 was normally distributed. They then extracted time data into a separate data frame and restructured it so that each feature value became its own column. The data frame was then PCA-reduced to a lower dimension, with 2 embeddings containing most of the information. They also engineered numerical features by chunker bins and one-hot encoding. The final dataset was constructed through groupby aggregation and jointed time PCA. They split the dataset 80:20 and tested different models such as linear regression, L2, Huber regression, XGBoost, and Random Forest. They concluded that learning the story and dataset, even without domain knowledge, and collaborating with others for feature engineering were crucial for success in machine learning. The final model achieved an error of 3.77464 after hyperparameter tuning.

Ity Soni and Madhav Ponnudurai (LifeGaveUsLemons):

Ity and Madhav, also known as LifeGaveUsLemons, focus was on exploring and analyzing the data to gain insights and improve the machine learning model. The team conducted exploratory data analysis (EDA) to identify null values, distributions, correlations, and outliers. They also looked at feature importances using random forest permutation importance. They engineered features and transformed them using scaling, encoding, pivot, and cosine transformation. The team used PCA for feature reduction and tested different models such as linear regression, random forest, support vector regression (SVR), and Huber regression. They also explored different approaches to handling outliers and conducted hyperparameter tuning. The final model achieved an MAE of 3.77027 in private and 3.8591 in public. The team concluded that feature engineering is critical in improving machine learning models, and using both in-class and out-of-class models can help in better model selection. Additionally, they noted the importance of dealing with outliers in health data.

Harsh Praharaj and Chenxi Li (XGBoostDonkey)

In this project, Harsh and Chenxi, also known as XGBoostDonkey, focused on understanding the data and preprocessing it to build better models. They handled redundant information by removing mean and median values. The team then conducted exhaustive modeling and tested different approaches to handling outliers. They achieved an MAE of 4.326 using linear regression with outliers removed at 2.5 standard deviation, 3.675 using SVR with a linear kernel and outliers removed, and 3.65 using SVR with the addition of the latest lags. They concluded that feature engineering is important and highlighted the importance of documenting models. They also noted that simpler models can be better in some cases.

Takeaways

Overall, the Advanced Machine Learning course offers an excellent opportunity for students to develop a deep understanding of advanced machine learning topics and their practical applications. The Data Competition Project provides an engaging and interactive experience that simulates real-world problem-solving processes in data science. With a constantly updated curriculum and practical, real-world experience, students are well-equipped to excel in their careers as data scientists.

If are you interested to know more about Master in Data Science in University of San Francisco, don’t hesitate to go to our website!