Not So Common Data Sources For Data Scientists

Brandon Cosley
Thinking Fast
Published in
4 min readJul 21, 2022

Keep Your Skills Fresh with Fresh Data Sets

Photo by Possessed Photography on Unsplash

As data scientists we need to keep up to date with our developer skills, even if we are not developers. I lead quite a few data scientists in my day-to-day activities. Those developers depend on me to steer the strategic direction of our efforts in support of the businesses we support. In order to make sure that I can contribute to how our solutions are shaped I need to understand how to apply data science at a technical level. I need to be able to speak the speak, so to speak 😊

One great way to keep up to date with your skillset is to develop your own data science projects. But there is nothing that annoys me more than working with canned data sets that have been pre-engineered to nicely fit into machine learning models.

Why?

Because these pre-engineered data sets allow us to pass up some of the most important data science skills. Namely, data engineering. Thus, I like to look for my own data, engineer that data to be machine readable, and then take advantage of the machine learning algorithms I am attempting to learn.

The consequence of building projects like this is that I get to keep my data engineering skills fresh, maybe even expand them a bit by grabbing data from new sources (e.g. webscraping, API calls, etc), and I have projects that are unique.

Uniqueness is important for two reasons when it comes to projects. First, as I add them to my portfolio, they are more likely to grab attention as they are simply not the same as some of the more watered-down tutorials we often find. And second, they have the potential to be more useful because they are built in service of using real, often live, data. Net effect, these portfolio projects continue to improve my visibility in the data-driven community.

But finding the open and freely available data is not always easy to do. This also bridges me to the next question I want to help answer.

Are you struggling to figure out what data might help your business?

Data scientists looking for free but unique data is one thing, businesses looking to identify free and open data that may also help their business is a closely related but more specific problem.

Whether you are developing your career or your business, understanding what data are available and the timeliness of their availability are important factors to consider.

Availability & Timeliness:

To say that there is a ton of data available for free online would be an understatement. Indeed, the availability of free data is massive. The problem however is that most of that data is either to aggregated to be relevant to a specific business concern and/or the data are not updated often enough to be currently useful.

Thus, we must go through the hard work of curating our own data pipelines from existing sources that do give us access to up-to-date information.

UNIQUE OPEN DATA SOURCES

Here are just a few ideas for finding data that are free, live, and may be useful when considering your next data science portfolio project or considering data that may have immediate relevance to your business needs:

- Your email address for collecting newsletters from competitors, Google News Alerts on specific topics, or any other information you choose to sign up for and have forwarded to your email account

- Yelp Fusion API (access business data with 5,000 API calls for free every day)

- USPTO Patent API (access patent data for the USPTO for free)

- Census.gov API (not necessarily timely but freely accessible and still useful for certain business needs)

- Spotify API (yep, that Spotify! who doesn’t like music, right? the API has a rate limit but can be useful for acquiring data on the latest

- Reddit API (the front door of the internet, right?)

- Zillow API (real estate data)

- Weather.com API (weather data)

So, in wrapping up, live data are great for the most useful business data that can help to grow even more useful data science portfolios and contribute to powerful business intelligence. Because most of these sources do rely on APIs it is important to note that APIs do have some disadvantages. The biggest disadvantage being that the APIs are supported by someone else and so your ability to continue to use said APIs is dependent on the company’s choice to allow you access.

Like engaging to learn about data science, career growth, life, or poor business decisions? Sign up for my newsletter here and get a link to my free ebook.

--

--

Brandon Cosley
Thinking Fast

Data Science Transformation Specialist | Start with newsletter and get my end-to-end approach to data science here www.fastdatascience.ai