CODEX
Where Can I Find Data For My Analytics or Data Science Project?
There are Different Ways to Get Datasets and Get Creative
If you are looking to become a Data Analyst or Data Engineer/Scientist, you will most likely or unavoidably need to create projects of your own and show case your work on Github (it’s free). A difficult aspect when planning these type of project is to find your dataset(s). Think about a topic that interests you the most and then you start searching! If you are a job seeker, ideally you should come up with something that is associated with the type of industry that you want to get into.
There are different formats in which data gets stored, therefore datasets will probably come as spreadsheets, CSV documents or plain text files. Please note that the different ways of obtaining and preparing data may vary in difficulty and requires some technical and basic data management knowledge. The following bullet points will describe the ways in which you may obtain data:
- Download it from the Internet. This is the simplest way to get your data. There are a few websites like Kaggle or Data.gov, in which you may find free data sets. Although be aware that some files will be incomplete or incorrectly formatted, which implies that you may have to ‘clean’ or fix the set. Don’t waste your time trying to fix poor-quality data if the ‘damage is beyond repair’. As an alternative, there are other sites in which you probably have to pay in order for them to provide you their information, but I recommend avoiding these if your goal is to develop a personal project.
- Create it yourself. An old fashion technique for obtaining data is throughout surveys. Plan your project’s questions and include the survey’s responses in a spreadsheet. When doing this make sure your data is formatted appropriately. In addition, you don’t want to waste time launching a survey that will lead to useless information that’s unrelated to your project. An uncommon method to get data (for instance, temperature) is throughout sensors! The downside of this method is that you’ll probably need knowledge in circuits, electronics and a general-purpose programming language to do so. I guess a similar alternative would be to manually write down results from other devices and compile your results or data. By personal experience, I once did a data processing project by recording live-stream temperature data from an Arduino board connected to my laptop. If your goal is to have dummy (fake) data to play with and gain some experience with some data analytics tools such as Excel, SQL or Python, you may want to consider using Python’s Faker. This library would generate random sets of data (names, dates, addresses, etc.), however be aware that the data’s quality might not be the best.
- Code a web scraper. If you decide to use this method please be aware that for some websites it is illegal to scrape their content or data (and even if you try it may very difficult to do so!). A scraper is basically a bundle of code that will read a webpage’s HTML tags and extract whatever content is inside those tags. It is okay to use your own code to fetch data from someone else’s webpage as long as it’s not personal information or you are not planning to sell it. For instance, you may scrape data from a website that provides information regarding ‘YouTube’s Top Channels’ or whatnot. The second downside of this method is that you need some knowledge in web development and a programming language like Python in order to extract the data and store it in a spreadsheet.
- Request from an API. An API (Application Programming Interface) will basically extract data from a source and will provide it to you directly. It does the same work as a web scraper, the only difference is that someone else went through the hassle of doing that work for you… In fact, some people make money by providing that type of service and maintaining APIs fully functional, hence, not all of these are free. An API will typically be presented as an ‘URL link’. I’ve personally have used a public API for a personal simple front-end web application, but you may connect it to Microsoft Excel. Obtaining data throughout this method may or may not require coding skills. As mentioned previously, do your research on what you want to base your project on, maybe the data you want or need is not free.
Conclusion:
The first step when it comes to working on a data analytics (or science) project is to figure out how to obtain your data. There are different methods of obtaining information, which may imply third parties or using your own sources to compile raw data. Make sure you know what type of project you want to develop prior to applying any of the four methods mentioned. Depending on what you are doing, some methods might be more efficient than others. Planning should be your first step.