Data Science Undergrad Journey Part 2

Part2: School Courses I find helpful

Alison Yuhan Yao
Nerd For Tech
9 min readJul 26, 2021

--

Photo by Matt Ragland on Unsplash

Are you transitioning into Data Science? Or are you a Data Science undergrad like me? What would I do differently if I were to start over again?

Intro

In my last post, I shared my story of why I chose Data Science as my undergraduate major.

I realize that unlike me, many current Data Scientists transitioned from other fields and job positions. They either slowly picked up the skills at work or spent time outside of work learning Data Science bit by bit. I feel lucky to be one of the first ‘orthodox’ Data Science undergrad students. Being in an undergrad program saves me the time and energy to design a curriculum on my own.

In this post, I will list the Data Science and related classes I have taken. I will also briefly summarize the class projects I have done (code provided). Hopefully, this post can give you some idea of what a DS undergrad program is like.

DS & DS-related Courses

First of all, as mentioned in my last post, I am a rising senior majoring in Data Science. Our school asks students to choose a track or concentration for Data Science. My concentration is Artificial Intelligence, so the following is just an example of the relevant classes I have taken. For students with tracks such as Finance or Political Science, they need 2–3 domain classes of their choice instead of the AI classes I list below.

Prerequisite

  • Intro to Computer Programming (Python, no project)
    It is an introductory Python class intended for students with no prior knowledge of programming. The class covers basic data types such as string, int, float, list, tuple and dictionary, basic logics like for loop and while loop, functions, and file input/output.
  • Calculus (Math)
    We do not have Calculus 1 or 2 or 3. Our Calculus class covers both 1 and 2 in one semester. Integrals and derivatives are used everywhere, in all other Math classes and any Computer Science/Data Science classes that involve Math.

Major Requirement

Programming classes:

  • Intro to Computer Science (Python, 1 final project)
    This class continues to teach Python, namely OOP and recursion with dynamic programming. The class covers a little bit of everything: Computer Architecture, Algorithms, Machine Learning, Deep Learning, and AI. We coded KNN from scratch and I heard about artificial neural networks for the first time.
    For the final project, we used Python to make a chat system with a GUI interface and a Pygame integrated.
  • Data Structure (Python, no project)
    This class reinforces our knowledge about OOP, recursion, and time & space complexity in this class. This class also covers data structure building blocks such as stacks, queues, linked lists, trees, hash tables, maps, sets, and graphs. Algorithms covered include sorting algorithms, searching algorithms, graph explorations (DFS, BFS), shortest path algorithms, minimum spanning trees, selection algorithms, priority queues, etc. What I found tricky was how to mix different data structures in the best way to solve a problem more efficiently.
  • Databases (SQL & Python & frontend languages, 1 final project)
    This class teaches relational databases, SQL syntax, and database management systems such as indexing techniques, query processing algorithms, and transaction management techniques. SQL is one of the essential skills for a data scientist, so learning SQL early probably increases one’s chance of getting an internship.
    For the final project, we built the entire frontend (HTML, CSS, Javascript, and Bootstrap) and backend (Python Flask) for customers, booking agents, and airline staff of an airline.
  • Intro to Optimization and Mathematical Programming (GAMS, 1 final project)
    I seldom see classes like this in other Data Science curriculums. This class covers model building for infrastructure systems, nonlinear programming and optimality conditions, linear programming and duality theory, network optimization models, and integer programming. What was novel to me was the GAMS software, which is a specialized tool for solving optimization problems. Machine Learning and Deep Learning algorithms essentially solve optimization problems, so this class is helpful in terms of understanding the underlying mechanisms.
    For the final project, we optimized the shuttle bus schedule for our school to cut costs while satisfying the demands of the students.
  • Machine Learning (Python, 1 final project)
    The class was separated into 7 weeks of Machine Learning and 7 weeks of Deep Learning. Models covered include Naive Bayes, Logistic Regression, Linear Regression, graphic models, EM, clustering, neural network, CNN, and RNN. This class changes vastly every semester. And I feel like my semester did not cover as much Machine Learning, but I made up for some in my Business Analytics course.
    In the final project, we compared several neural networks to design a traffic sign image classifier. It was right after this class that I got my internship in computer vision.
  • Regression and Multivariate Data Analysis (Minitab or R, 5 homework projects)
    This class is a graduate-level class that I really enjoyed. It covers simple and multiple linear regression, how to check assumptions, how to address the violation of assumptions, time series analysis, ANOVA and ANCOVA, and logistic regression. We can use any stats software or programming language at will, I reverted to default and learned Minitab from scratch.
    We had 5 homework projects in total. We could decide the topics on our own and collect our own data. As a big Marvel fan and TV show fan, I had a lot of fun analyzing the Chinese box office performance of superhero movies and predicting whether my favorite shows are going to be renewed or canceled.

AI Classes:

  • Reinforcement Learning (Python, 1 final project)
    The class is divided into 7 weeks of tabular RL and 7 weeks of Deep Reinforcement Learning. The first half covers Multi-armed Bandits, Marvok Decision Processes, Dynamic Programming, Monte Carlo Methods, Temporal-Difference Learning, and Dyna-Q. The second half covers Policy Gradient/Reinforce Algorithm, Actor-Critic, DQN, DDPG, TD3, SAC and interesting examples like Alpha Go and Atari games.
    The final project is a research project instead of a DRL application. I ran experiments to see whether a new method, the multi-step method, can replace the existing clipped double-Q method in terms of bias reduction. On top of that, I tested if combining clipped double-Q and the multi-step method can yield a better result.
  • Natural Language Processing (Python, 1 final project)
    This class is the most difficult one I’ve had so far. The class covers N-gram language model, word embeddings, neural network language model, sequence labeling, expectation maximization, neural sequence modeling, GPT, encoder-decoder, Transformers, and BERT. This class covered a lot at a swift pace, but we eventually got to the state-of-the-art models that work wonders.
    In the final project, we used BERT-based models to convert questions in Chinese to SQL statements. Special thank you to Google Colab GPU!

Math classes:

  • Probability and Statistics (Math)
    This class covers basic probabilistic concepts, including sample space, random variable, probability distribution, moment, mean, variance, and correlation. It also introduces various discrete and continuous probability distributions, including Bernoulli, binomial, geometric, Poisson, uniform, normal and exponential, as well as transformations of random variables. It also teaches the law of large numbers, central limit theorem, statistical inference, and hypothesis testing. I find the concepts covered highly important. Statistics is ubiquitous in Data Science and probability is the cornerstone of, for instance, Reinforcement Learning.
  • Linear Algebra (Math)
    This class covers systems of linear equations, matrix operations, determinants, vector spaces, eigenvalues and eigenvectors, orthogonality and least squares, symmetric matrices, spectral decomposition, quadratic forms, etc. I don’t think the more advanced part of linear algebra appears often in other classes, but the basics like matrix operation are used frequently in, for example, Deep Learning.
  • Multivariable Calculus (Math)
    Building on Calculus, Multi Calc teaches sequences and series, the geometry of space, vector functions, partial derivatives, multiple integrals, and vector calculus. It is useful in understanding the mechanism of Stochastic Gradient Descent, etc.

Related Classes

  • Thinking, Learning, and Consciousness of Humans and Machines
    I personally believe that ethical practice should be a mandatory component for any tech-related major. This class is not a major requirement, but topics like responsible Data Science, AI biases, and opportunities and limitations of technologies are enlightening.
  • Business Analytics (Python & R, 1 competition)
    This is a business class that applies Machine Learning models to real-life business cases. The class covers linear regression, logistic regression, KNN, regression tree, clustering, bias-variance tradeoff, CART trees, regularization, feature engineering, clustering, PCA, causal inference and potential outcomes model, etc. The class felt like a Machine Learning class with fewer theories and more application.
    We had an in-class Kaggle competition at the end of the semester, in which our team won first place in predicting takeout delivery time.
  • Experiential Learning Seminar (I used Python, 1 final project)
    This class is such a unique existence on my transcript. It triples the workload of a normal class because students need to have an internship as well. Outside of class, I had a Data Science internship that occupies 2 days per week (I will save everything about internships in my next blog). In class, I did a Social Science project on the topic of Data Science. Since the question I was trying to answer was too new for any literature review and existing databases, I had to temporarily forget about my Data Science skills and resort to conducting interviews instead. That’s why it was formulated as a social science project. However, since I was studying NLP at the same time, my DS instinct kicked in and I managed to do some analysis on the interview transcripts (they are text data) in the end.
  • Discrete Mathematics (Math)
    This class is the prerequisite of multiple CS classes. I never needed it as a prerequisite, but I find topics like time complexity, big-O notation, recursion logic, matrices (Linear Algebra alert), and probability useful.

Future Plan

As an undergrad student, I am still exploring a little bit of everything. Now that I have finished all of my major requirements apart from the final capstone project, I can take a breath and explore more Arts and Media courses in my senior year. But at the same time, here is what I plan for Data Science:

  • Learn more about R
    You probably noticed that most of the programming courses I listed above use Python. Python has won the war against R, but during my DS internship, I noticed that R is a great tool that’s still widely used in many companies.
  • Data Visualization
    No one can resist a beautiful plot. Yes, Matplotlib and Seaborn are in our arsenal all along, but for a frontend developer such as myself, I have always wanted to learn d3.js and create amazing interactive plots.
  • Learn more about Stats
    The classes I’ve taken do not cover as much Statistics application as I would like. Sometimes the good old statistical models can win over the business people and actually be deployed to create business impact. Black box models do not necessarily have an edge.
  • Capstone Project
    All DS students need to complete a capstone project to check the last box of graduation requirements. There is still plenty of time for me to learn new things, but I presume that my capstone project may lie somewhere between Computer Vision and Reinforcement Learning.

Conclusion & Reflection

I’ve enjoyed all of the DS & DS-related courses I’ve had so far. Project-based learning suits me well. My programming skills certainly improved over time. The algorithms in class are absolutely intriguing and intellectually stimulating. These courses draw on each other and some key ideas echo throughout the years. For example, the idea of approximate optimization is the foundation of Stochastic Gradient Descent, which appears everywhere in ML and DL. The tradeoff of exploration vs exploitation in Reinforcement Learning is also probably one of my most important revelations in life.

Image by Author

If I were to plan my classes differently, I would probably fix the order of my classes. I would have:

  • taken Databases earlier
    I think it would have been wiser to learn SQL at an early stage, as it may better equip one for an internship. And internships are fun!
  • taken Regression and Multivariate Data Analysis earlier
    Regression models can come before other Machine Learning models. Regression models refresh one’s stats knowledge and serve as a gentle introduction to Machine Learning.

Other than that, things worked out pretty well for me :)

Acknowledgment

When planning this series, I came across Jaemin Lee’s article on a similar topic. He has a very different story to tell, but his article definitely became one of my inspirations. I was quite surprised to see that our course requirements differ a lot. None of my classes teaches Java or C++. And I have not taken as many Statistics classes. I certainly recommend checking out Jaemin’s blog.

Next Blog — Internship & Research

My internships and research opportunities in Data Science and AI have really opened my eyes to the tech world. I will dig deep into the details in the next blog post.

Thank you for reading my blog! I hope you find it helpful.

My Github: https://github.com/AlisonYao

My Kaggle: https://www.kaggle.com/alisonyao

--

--