Data Science Project Based Learning

Timmy Chan
Ed-Tech Talks
Published in
10 min readMar 1, 2022
Data and analytics-enabled industry and business transformation (Cao, 2018)

“How can I teach myself data science?” is a common question those of us transitioning to data science ask ourselves. In essence, when self-teaching, we’re creating a learning environment, and embody it both as instructor and learner. When journaling or blogging about our learning process, data scientists share qualitative data on learning environments via narratives. This post adds to the vast torrents of posts on self-teaching data science with guidance from learning science research — using the science of learning to inform how one can learn the science of Big Data.

With this in mind, after writing about my positionality and briefly pondering the ethical side of data science with respect to qualitative sensibilities, I set out to document the planning phase of my learning process for Data Science.

How People Learn: Data Science

Borrowing the How People Learn framework from Learning Sciences, I generated the following guiding questions for this project:

  1. Learner-centered: What I bring to the table before starting? Since this article is on self-teaching, a reflection on one’s positionality is key to reflexivity.
  2. Knowledge-centered: What do I need to demonstrate to say I know Data Science? What are the technologies that need to be used? What are core constructs and skills in Data Science? What are the inquiry practices/strategies of reasoning in Data Science?
  3. Assessment-centered: How do I demonstrate mastery, in each particular case? What are ways to determine high quality, expert work; how do novices see their own growth?
  4. Community: How and where can novices engage in a community of practice with the experts? Where are the common forums for Data Scientists to constructively critique each others’ work, and to collaborate?
How People Learn (Bransford et al., 2000)

Who is Learning?

For introspection on my history as a learner, I practice reflective writing in the form of a blog. As the learner is always changing — a habit to practice reflexivity is key to quality research and learning in general. Though since this blog is published, perhaps one day one of the readers here might be a trying to teach themselves data science too.

I earned a master’s in mathematics and my thesis was written using Python, so a significant amount of data science fundamentals already exist in my toolkit. My background as a working class first-generation immigrant climbing through graduate school also gives me perspective during reflexivity exercises when approaching data.

Reflexivity refers to circular relationships between cause and effect, especially as embedded in human belief structures. A reflexive relationship is bidirectional with both the cause and the effect affecting one another in a relationship in which neither can be assigned as causes or effects. (Wikipedia)

Trainings in meta-cognition, pedagogy and instruction planning helps to inform quality documentation of the learning process and lesson planning. I practice qualitative sensibilities and methodologies first through working on Learning Sciences research— this allows for me to practice evaluating analytical techniques in depth as well as creating context-rich stories from data. This will certainly influence my approach, as well as what I may choose to focus on in the learning process.

Data Science Core Competencies

A brief search yields some recent works on a survey of undergraduate and graduate data science curricula. This list of competencies is based on 2021 ASEE Virtual Annual Conference Content Access Proceedings, which in turn is based on the ACM Data Science Task Force Report.

This gives us a working definition of the core competencies. Bolded are tools associated with the topic I reordered the list, to prioritize ethics and qualitative sensibilities as the very first question one should ponder before diving into Data Science.

  1. Data In Context: Teamwork, Economic Considerations, Ethics, Legality, Intellectual Property, and Communication.
    Note to Mathematicians: For those of us who have worked on projects in math education research or transitioned laterally from academia to teaching positions, then one can be considered well equipped on these topics. Data privacy rules and compliance various interactions with the Institutional Review Board at an university is a type of experience that transfers.
  2. Mathematics and Statistics: calculus, discrete structures (number theory and abstract algebra), probability, statistics, linear algebra, real and complex analysis, topology and optimization. See my notes on these subjects.
  3. Computing Fundamentals: Operating system basics, File systems, Networks, Compilers/Interpreters, Data Structures, Algorithms & Complexity Analysis, Numerical Computing (R, NumPy, SciPy, etc.), Object-Oriented Programming (Python), Human-Computer Interaction
    Note to Mathematicians: SAGEMath is academic numerical computing in Python, if you’ve used it, put that on your resume but do explain it is Python.
  4. Data Management, Governance, Privacy: Acquisition (DAQ), Cleaning, Compression/Reduction (Lossy vs Lossless), Transformation and Integration (SQL), Integrity, and Security
  5. Data Visualization: Analysis, Presentation (Tableau), User-centered design, Interaction Design, Interface Design and development
  6. Machine Learning and AI: Logic-Based vs. Probability-Based knowledge representation and reasoning, planning and search strategies, Supervised Learning, Unsupervised Learning, Mixed Methods, Deep Learning
  7. Data Mining, Big Data: Problems of scale, big data computing architectures, parallel computing frameworks, distributed data storage, parallel programming, cloud computing, proximity measurement, data preparation, information extraction, cluster analysis, classification and regression, pattern mining, outlier detection, time series data, mining web data, information retrieval.

Assessment and Community

We stand on the shoulder of giants. Joining a community is key to learning. Photo Credit: Luke Tanis

Since self-teaching is informal, one must join communities to ensure critical evaluation of one’s work. While mathematical rigor can guide us to cogent analytical conclusions, one must consider if the conclusions make sense in context, which can be done via peer review by other data scientists. Some key communities for Data Science are ODS (Open Data Science), Kaggle and IBM Data Science Community. R4DS is another great community, with emphasis on R. Some more general communities like Reddit and have thriving programming and data science communities too. And of course, following some good blogs like TDS (Towards Data Science).

Through time, as one’s portfolio grows to include code contributions to projects, to eventually joining competitions; a novice may transition to become an expert in those same communities. This process is exactly Legitimate Peripheral Participation in an informal learning context.

While knowledge acquisition and self-assessments can be done using textbooks, and one certainly should collect a library of textbooks on useful topics, my choice approach is to complement surveying literature with Inquiry-Based Learning paired with Project-Based Learning.

Centerpiece: Portfolio as Practicum

Data Science Trajectories (Martinez-Plumed et al., 2021) and Inquiry Based Learning (Pedaste et al., 2015)

“Data science is necessarily highly experiential; it is a practiced art and a developed skill. Students of Data Science must encounter frequent project-based, real-world applications with real data to complement the foundational algorithms and models.” (De Veaux et al., 2017)

Storytelling, and communicating about data efficiently with a narrative that is clear to all stakeholders, require practice. For this reason, creating a portfolio is key component to a successfully learning data science. At the core of the Data Science Trajectories, is an inquiry process. The outer set in the figure for elements of Data Science Trajectories map neatly onto the orientation and conceptualization stages of inquiry based learning. Project based learning is an application of inquiry based learning framework, where each project models a a small inquiry cycle, or a sub-component of the cycle.

Components of a Data Science Portfolio:

Included work can be written, visual, or oral. Written work can include summaries of findings or thorough figure captions. Visual work can include static or interactive data visualizations. Oral work can include a recording of a presentation, a discussion among peers, or a think-aloud interview (Reinhart et al., 2020).

  1. Blog: This is a place for the learner to reflect and share learning by writing. Storytelling is an underrated (and writing, arguably, is the most difficult subject to teach).
  2. Important Links to unique content: professional social profiles like LinkedIn, Twitter and Facebook should be clearly displayed. GitHub account to show version control and share code, and learn to use Jupyter notebooks.
  3. About: Positionality statement helps other researchers understand the unique lens one can bring to data science projects
  4. Projects: What motivated the project? What was the outcome, both learning and product? This is a place to document growth as well.

Some Data Science Portfolio Project Ideas

Most of the following ideas are from a recent paper on the merits of portfolios as a pedagogical instrument in Data Science curriculum design (Nolan & Stoudt, 2021). I have added a couple here myself too.

  • Write a personal statement about oneself as a data scientist. Include a description of the researcher’s lenses, potential influences on the research, the research-project context and an explanation as to how, where, when and in what way these might, may, or have, influenced the research process (Darwin Holmes, 2020).
  • Write a survey paper on the process of inquiry and methodologies in Data Science
  • Write a blog post about the process of learning and exploring in data science.
  • Collect data for a week and draw a Dear Data postcard.
  • Choose a data set. Describe the ideal data needed to answer a question of interest. Participate in Tidy Tuesday.
  • Exploratory data analysis (EDA). Write a blog post about the exploration process. Write and revise captions and/or alt-text for visualizations.
  • Preliminary analysis. Keep a data diary and write an accompanying blog post.
  • Get and Give feedback on analysis. Create a single visualization that reveals a finding to the audience quickly. Participate in peer code review.
  • Outline report. Link visualizations together to tell one coherent story, that is, storyboard.
  • Read (and/or Write), analyze, and critique the structure of a formal, data-driven article.
  • Use a new tool to redo an analysis, and write to contrast with your previous tool
  • Write a blog post about a method.
  • Write a blog post about your findings.
  • Read (and/or Write), analyze, and critique the ending of a formal, data-driven article.
  • Practice writing an abstract. Explain your work in plain language using xkcd’s Simple Writer.

Here in this article I have described my plan for learning Data Science, along with research-informed common core concepts and project ideas. Springboard has a great guide on starting a data science portfolio as well. The projects are each a component of the data science inquiry process, and one can choose a data science question, and generate many posts and artifacts from one project. Organizing all the projects into a way that is aesthetically pleasing and easy to navigate is yet another practice in User Experience.

In the next article, I’ll explore dependencies between data science core competency with an intention towards visualizing prerequisites (different from professional progression, which is described in this article by Sequoia). I am tinkering with a web scrapping script to gather data from college bulletins, with the goal to create a visualization in the form of flowcharts.

Timmy Chan is a mathematician actively seeking a data scientist role.

LinkedInTwitterFacebook

Works Cited

Aho, T., Sievi-Korte, O., Kilamo, T., Yaman, S., & Mikkonen, T. (2020). Demystifying Data Science Projects: A Look on the People and Process of Data Science Today. In M. Morisio, M. Torchiano, & A. Jedlitschka (Eds.), Product-Focused Software Process Improvement (Vol. 12562, pp. 153–167). Springer International Publishing. https://doi.org/10.1007/978-3-030-64148-1_10

Al Sarkhi, A., & Talburt, J. (2019). The Journal of Computing Sciences in Colleges Papers of the 17th Annual CCSC Mid-South Conference. https://doi.org/10.13140/RG.2.2.29810.12481

Brunsdon, C., & Comber, A. (2021). Opening practice: Supporting reproducibility and critical spatial data science. Journal of Geographical Systems, 23(4), 477–496. https://doi.org/10.1007/s10109-020-00334-2

Cao, L. (2018). Data Science: A Comprehensive Overview. ACM Computing Surveys, 50(3), 1–42. https://doi.org/10.1145/3076253

Darwin Holmes, A. G. (2020). Researcher Positionality — A Consideration of Its Influence and Place in Qualitative Research — A New Researcher Guide. Shanlax International Journal of Education, 8(4), 1–10. https://doi.org/10.34293/education.v8i4.3232

De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S., Bray, A., Bressoud, T. C., Bryant, L., Cheng, L. Z., Francis, A., Gould, R., Kim, A. Y., Kretchmar, M., Lu, Q., Moskol, A., Nolan, D., Pelayo, R., Raleigh, S., Sethi, R. J., Sondjaja, M., … Ye, P. (2017). Curriculum Guidelines for Undergraduate Programs in Data Science. Annual Review of Statistics and Its Application, 4(1), 15–30. https://doi.org/10.1146/annurev-statistics-060116-053930

Demchenko, Y., Belloum, A., Los, W., Wiktorski, T., Manieri, A., Brocks, H., Becker, J., Heutelbeck, D., Hemmje, M., & Brewer, S. (2016). EDISON Data Science Framework: A Foundation for Building Data Science Profession for Research and Industry. 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 620–626. https://doi.org/10.1109/CloudCom.2016.0107

DEVELOPING A MASTER’S DEGREE PROGRAM IN DATA SCIENCE. (2021). Issues In Information Systems. https://doi.org/10.48009/3_iis_2021_58-68

Feelders, A., Daniels, H., & Holsheimer, M. (2000). Methodological and practical aspects of data mining. Information & Management, 37(5), 271–281. https://doi.org/10.1016/S0378-7206(99)00051-8

Kross, S., & Guo, P. J. (2019). Practitioners Teaching Data Science in Industry and Academia: Expectations, Workflows, and Challenges. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–14. https://doi.org/10.1145/3290605.3300493

Kross, S., Peng, R. D., Caffo, B. S., Gooding, I., & Leek, J. T. (2017). The democratization of data science education [Preprint]. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.3195v1

Li, D., Milonas, E., & Zhang, Q. (2021). Content Analysis of Data Science Graduate Programs in the U.S. 2021 ASEE Virtual Annual Conference Content Access Proceedings, 36841. https://doi.org/10.18260/1-2--36841

Martinez, I., Viles, E., & Olaizola, I. G. (2021). Data Science Methodologies: Current Challenges and Future Approaches. Big Data Research, 24, 100183. https://doi.org/10.1016/j.bdr.2020.100183

Martinez-Plumed, F., Contreras-Ochando, L., Ferri, C., Hernandez-Orallo, J., Kull, M., Lachiche, N., Ramirez-Quintana, M. J., & Flach, P. (2021). CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories. IEEE Transactions on Knowledge and Data Engineering, 33(8), 3048–3061. https://doi.org/10.1109/TKDE.2019.2962680

Milonas, E., Li, D., & Zhang, Q. (2021). Content Analysis of Two-year and Four-year Data Science Programs in the United States. 2021 ASEE Virtual Annual Conference Content Access Proceedings, 36842. https://doi.org/10.18260/1-2--36842

National Research Council. 2000. How People Learn: Brain, Mind, Experience, and School: Expanded Edition. Washington, DC: The National Academies Press.https://doi.org/10.17226/9853.

Nolan, D., & Stoudt, S. (2021). The Promise of Portfolios: Training Modern Data Scientists. Harvard Data Science Review. https://doi.org/10.1162/99608f92.3c097160

Pedaste, M., Mäeots, M., Siiman, L., Jong, T., Riesen, S., Kamp, E., Manoli, C., Zacharia, Z., & Tsourlidaki, E. (2015). Phases of inquiry-based learning: Definitions and the inquiry cycle. Educational Research Review, 14. https://doi.org/10.1016/j.edurev.2015.02.003

Tang, R., & Sae-Lim, W. (2016). Data science programs in U.S. higher education: An exploratory content analysis of program description, curriculum structure, and course focus. Education for Information, 32(3), 269–290. https://doi.org/10.3233/EFI-160977

--

--

Timmy Chan
Ed-Tech Talks

Professional Software Engineer, Master Mathematician interested in learning and implementing multidisciplinary approaches to complex questions