5 Ways to Open and Read Your Dataset Using Python

Different approaches for different purposes

Image for post
Image for post
Photo by Markus Spiske on Unsplash.

What is the first step in the data analysis process? I believe that firstly you open the dataset to acquire the data. If you work with Tableau, then it executes most of the work for detecting file types, data types, delimiters, and encoding. But how should you proceed if you work directly in Jupyter?

Here are a few ways to open a dataset depending on the purpose of the analysis and the type of the document.

1. Custom File for Custom Analysis

Working with raw or unprepared data is a common situation. Well, it is one of the stages of a data scientist’s job to prepare a dataset for further analysis or modeling. No friendly CSV format, no structure, custom delimiters, etc. That’s why it’s important to have skills regarding the functionality of native Python files.

For example, we have a dataset of text messages for a spam detecting algorithm. But all the messages have punctuation, different word cases, and a few more problems. So our task is to convert every message into a set of tokens or at least separate words. In this case, we’ll read each line with the help of Python file functions and split it manually:

Now we have some kind of order:

Image for post
Image for post

Looks like this approach works, but what if we have a file with non-standard encoding?

2. File With Custom Encoding

As with almost all things in Python, there already exists a library (codecs) that provides access to different types of encoding. Except for the additional wrapper, the code is almost the same as in the previous example:

As predicted, the output is the same.

So, this is the most flexible way to work with datasets. But why do we need to reinvent the wheel if we have existing solutions for defined data types?

3. CSV Files With Native Library

I think that the CSV format is the most common and convenient — at least in my experience. It’s not a surprise that Python has a separate library for this type. Even more, it can also work with almost any organized dataset with the single delimiter symbol. Let’s apply it for our spam messages:

And here we see that parsing with different delimiters has different outputs:

Image for post
Image for post

It may be not a big deal since we can correct it in a few lines of code or provide additional parsing. But I believe that you will prefer a more convenient way to parse your dataset.

4. pandas

Yes, I know that the pandas library is overused, but I need to mention it since its function for reading files is the most convenient of all I have encountered. It allows you to read files with several delimiters, skip some lines, choose concrete columns, and more. Dive into the documentation for more details. Also, since the standard read_csv() and other functions are built above read_table(), the single example will be enough. We will use the numeric dataset with parameters of heat cycle variables:

As you see, it is the shortest variant since it has already absorbed all the wrappers we had written manually. As an output, we got a well-structured table:

Image for post
Image for post

But what if an even neater variant of dataset reading exists with concrete functional focus?

5. Read Numeric Dataset

The NumPy library has file-reading functions as well, but they are underrated and shadowed with pandas analogs. np.loadtxt() is not as general as pd.read_table(), but it is perfect for numeric datasets like in our previous examples.

As a result, we have a pure functional NumPy table ready for analysis and calculations:

Image for post
Image for post

Conclusion

Almost any task in programming can be done in several ways, especially if we’re talking about Python. That’s why there is more than one approach to even opening a file. With that said, every function has its purposes and abilities.

You can find the Jupyter notebook with a working example on my GitHub:

Also, you are free to share own approach to working with data.

Better Programming

Advice for programmers.

By Better Programming

A weekly newsletter sent every Friday with the best articles we published that week. Code tutorials, advice, career opportunities, and more! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Thanks to Zack Shapiro

Pavel Horbonos (Midvel Corp)

Written by

Stochastic programmer | Art & Code | https://github.com/Midvel 💻| https://www.instagram.com/midvel.corp 🎨⠀| Blockchain developer in https://blaize.tech/

Better Programming

Advice for programmers.

Pavel Horbonos (Midvel Corp)

Written by

Stochastic programmer | Art & Code | https://github.com/Midvel 💻| https://www.instagram.com/midvel.corp 🎨⠀| Blockchain developer in https://blaize.tech/

Better Programming

Advice for programmers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store