Becoming a Data Scientist at 67 years old, Part 1
A while back, I wrote a story about becoming a web developer at 66 years old (for my nonprofit).
I am a decent web developer now, and the flagship application I wrote for my nonprofit runs on AWS (https://sibsforever.org). Sibs Forever, my nonprofit company, gets $1000 of cloud credits annually from AWS, which covers all hosting expenses. The architecture of the Sibs Forever application is described in this Medium article.
When I completed phase 1 of the application, I reached out to many NPOs attempting to get my service listed on one of their online resources pages. I discovered disorganized pages of seemingly random links during this process, many of which were either stale, outdated, or downright misleading. This discovery led to my idea of writing an application that would curate, rank, and offer crowdsourced reviews for online resources (e.g., for grief, cancer, and eating disorders). The idea is quite simple: when someone is searching for an online resource, this application will return only up-to-date, relevant, and appropriate links. Easier said than done.
After considering this further, I realized this project would have a significant machine learning (ML) piece where scraped search results and corresponding web pages would provide the data needed to feed the ML model, supplemented by clickstream and crowdsourced review data. I purchased the domain onlineresources.org, created a Trello board, and then turned my attention to how I would become a data scientist for this project and where I would get the voluminous data I would need to train the ML model. Before I founded my nonprofit, I hadn’t written code in many years while consulting and holding various leadership positions. Before that, in my previous 40+ years, I had not done any work in this space.
Are there organizations providing pro bono data?
And the answer is… Yes! I want unfettered access to Google and Bing search results and the ability to scrape individual web pages for analysis using natural language processing (NLP) techniques.
There are a lot of companies that offer web scraping and parsing services. I reached out to more than a dozen companies. Most of them extended a discount ranging from 10% to 40%. Two of them offered their services pro bono (i.e., for free), with the caveat that they receive credit somewhere on the onlineresources.org website. I’m excited to tell you about them:
- The Bright Initiative: This is a description from their website:
The Bright Initiative is a global program and organization that uses public web data to drive positive change. Powered by Bright Data, one of the world’s most powerful web data platforms, the Initiative provides public bodies, non-profit organizations, and academic institutions around the world with data and expertise to tackle the most pressing global issues of our time. To date, the Initiative includes over 350 organizations and universities, such as Princeton University, Oxford University, Virginia Tech, and many more.
2. SerpAPI: SerpAPI offers a fast and easy-to-use API to more than 20 search engines, including Google and Bing. I tested the API and found it to be reliable and complete. SerpAPI gave my NPO 100K free monthly searches to match my request. Here is a snippet from their website:
Leverage our infrastructure (IPs across the globe, full browser cluster, and CAPTCHA solving technology), and exploit our structured SERP data in the way you want.
The Journey to Become a Data Scientist
This is a definition of Machine Learning from https://www.geeksforgeeks.org/machine-learning/: Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. ML is one of the most exciting technologies that one would have ever come across. As it is evident from the name, it gives the computer that makes it more similar to humans: The ability to learn. Machine learning is actively being used today, perhaps in many more places than one would expect.
The title of my article (that you are reading) includes the words “Part 1”, and the reason for this is two-fold:
- Gaining the necessary ML background is a long process. Besides learning Python, there are numerous tools to master, such as Conda, Jupyter Notebooks, Panda, NumPy, and spaCy. Also, there is the associated math theory and statistics, which I’m excited to explore. I was a mathematics major in college and loved using applied math in professional projects. It’s only happened a handful of times in more than 40 years, so I’m stoked about this.
- The project itself is enormous, more extensive than designing and building https://sibsforever.org. Creating an end-to-end data-driven pipeline is challenging and demanding. Writing the user application to support crowdsourced reviews and ranked data delivery will be difficult and time-consuming. And so much fun.
I will describe what I have done so far. And will cover the rest in future write-ups (e.g., Part 2, Part 3).
High-Level Objectives
- I will strive to be as cloud-agnostic as possible. The various clouds offer credit and grant opportunities for NPOs. Since this project will be very resource-intensive, it should be able to run on any cloud or be multi-cloud. For this reason, I’ve chosen to use Databricks (a managed Apache Spark in the cloud) as the foundation since all three major clouds have first-party integrations with Databricks. I will handle cloud-specific configuration parameters appropriately.
- As a result of objective 1, I will only use widely-available libraries (such as Scikit-learn and spaCy) and minimize my use of cloud-specific services and libraries.
Learning Databricks
I completed the following Udemy course to come up to speed on Databricks:
- Azure Databricks and Spark core for Data Engineers: As the picture below illustrates, Azure Databricks is Azure’s first-party integration with Apache Spark. My ML notebooks will execute in this environment.
Machine Learning and Natural Language Processing Training
- I’ve chosen Python for this project because its libraries extensively support data science and machine learning. Examples include NumPy, Matplotlib, spaCy, and Scikit-learn. I haven’t used Python in decades, so I enrolled in this course: Complete Python Developer in 2022. Since Python is an easy language, and Andrei Neagoie is a fabulous instructor (which I already knew from taking his Advanced Javascript class), the course breezed by.
- I completed Complete Machine Learning and Data Science Bootcamp in 2022, a comprehensive project-based course that covers various topics using widely available tools. It was time-consuming as there was a lot to learn.
- I have just started this course: Natural Language Processing in Python, which covers the relevant math, statistics, and modern neural network architectures.
Till Next Time
As you can see, this is a big project. So far, it has been a great learning experience and a lot of fun. Stay tuned — I will publish follow-up articles documenting my progress and lessons learned.