5 Ways to Open and Read Your Dataset Using Python
What is the first step in the data analysis process? I believe that firstly you open the dataset to acquire the data. If you work with Tableau, then it executes most of the work for detecting file types, data types, delimiters, and encoding. But how should you proceed if you work directly in Jupyter?
Here are a few ways to open a dataset depending on the purpose of the analysis and the type of the document.
1. Custom File for Custom Analysis
Working with raw or unprepared data is a common situation. Well, it is one of the stages of a data scientist’s job to prepare a dataset for further analysis or modeling. No friendly CSV format, no structure, custom delimiters, etc. That’s why it’s important to have skills regarding the functionality of native Python files.
For example, we have a dataset of text messages for a spam detecting algorithm. But all the messages have punctuation, different word cases, and a few more problems. So our task is to convert every message into a set of tokens or at least separate words. In this case, we’ll read each line with the help of Python file functions and split it manually:
Now we have some kind of order:
Looks like this approach works, but what if we have a file with non-standard encoding?
2. File With Custom Encoding
As with almost all things in Python, there already exists a library (codecs) that provides access to different types of encoding. Except for the additional wrapper, the code is almost the same as in the previous example:
As predicted, the output is the same.
So, this is the most flexible way to work with datasets. But why do we need to reinvent the wheel if we have existing solutions for defined data types?
3. CSV Files With Native Library
I think that the CSV format is the most common and convenient — at least in my experience. It’s not a surprise that Python has a separate library for this type. Even more, it can also work with almost any organized dataset with the single delimiter symbol. Let’s apply it for our spam messages:
And here we see that parsing with different delimiters has different outputs:
It may be not a big deal since we can correct it in a few lines of code or provide additional parsing. But I believe that you will prefer a more convenient way to parse your dataset.
Yes, I know that the pandas library is overused, but I need to mention it since its function for reading files is the most convenient of all I have encountered. It allows you to read files with several delimiters, skip some lines, choose concrete columns, and more. Dive into the documentation for more details. Also, since the standard
read_csv() and other functions are built above
read_table(), the single example will be enough. We will use the numeric dataset with parameters of heat cycle variables:
As you see, it is the shortest variant since it has already absorbed all the wrappers we had written manually. As an output, we got a well-structured table:
But what if an even neater variant of dataset reading exists with concrete functional focus?
5. Read Numeric Dataset
The NumPy library has file-reading functions as well, but they are underrated and shadowed with pandas analogs.
np.loadtxt() is not as general as
pd.read_table(), but it is perfect for numeric datasets like in our previous examples.
As a result, we have a pure functional NumPy table ready for analysis and calculations:
Almost any task in programming can be done in several ways, especially if we’re talking about Python. That’s why there is more than one approach to even opening a file. With that said, every function has its purposes and abilities.
You can find the Jupyter notebook with a working example on my GitHub:
Permalink Dismiss GitHub is home to over 40 million developers working together to host and review code, manage…
Also, you are free to share own approach to working with data.