Import Your data from Different sources using Python

Ayşe Bat
My Data Science Journey
6 min readMar 15, 2019

The data could have come from different sources, we need to ability to import and manage this file, successfully. So, in this source, we will look at how to import data from different file type. We will learn to import the files from a flat file(.txt, .csv), other types of software (Excel, Stata, SAS, Matlab), and.

Import data from Flat File

To check out any plain text file, you can use the Python’s open() function to open a connection to the file.

#assign the file name to the variable string
file_name = 'file_name.txt'
#pass the file name to the open function using 'r' mode
file = open(file_name, mode='r')# only for reading
#to connection to the file apply the read() function
text =file.read()
#close the connection
file.close()

By using a context manager construct that allows us to create a context in which you can execute commands with the file open. We can avoid having to close to connection to the file using the ‘with’ statement. For the large files, we may want to print a few lines. You can use ‘file.readline()’ function to execute the first line of a text file. If you execute the same comment again, the second line will be printing and so on.

#By using the with open we don't need to close the connection.
with open(file_name, 'r') as file:
print(file.read())
#print out the file line by line executing with file.readline()
with open(file_name, 'r') as file:
print(file.readline())
print(file.readline())
print(file.readline())

Importing Flat File using Numpy

Flat files are a basic text file containing records, that is table data, without structured relationships.

We could use the numpy function loadtxt()

  • The first argument will be the filename.
  • The second will be the delimiter which can take ',' for comma separated file and '\t' for tab-delimited file
  • skiprows allows you to specify how many rows you wish to skip.
  • usecols takes a list of the indices of the columns you wish to keep.
  • dtype=’str’ will ensure that all entries are imported as a string.
import numpy as np
file_name = ‘data/mnist_kaggle_some_rows.csv’
#Default delimiter is white space so, we need the specify delimiter parameter explicitly.
data = np.loadtxt(file_name, delimiter=’,’)
#we are reading a file as a numpy array
print(type(data))
print(data)
<class 'numpy.ndarray'>
[[1. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[1. 0. 0. ... 0. 0. 0.]
...
[2. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[5. 0. 0. ... 0. 0. 0.]]

Importing different datatypes:

Let’s import the text file which has

  • text header, consisting of a string
  • tab-delimited

Due to the file containing string header we need to handle it. If we import the file using loadtxt() function without handling string file we could get the ValueError.

  • Alternative 1: We can set the dtype = str to avoid to ValueError,
  • Alternative 2: We can skip the first row, using the skiprows argument!
#Alternative 1:
file_sea_slug = ‘data/seaslug.txt’
data_alt_1 = np.loadtxt(file_sea_slug, delimiter=’\t’, dtype=str)
print(type(data_alt_1))
print(data_alt_1[0:5])
<class 'numpy.ndarray'>
[['Time' 'Percent']
['99' '0.067']
['99' '0.133']
['99' '0.067']
['99' '0']]
#---------------------------------------------------------------##Alternative 2:
data_alt_2 = np.loadtxt(file_sea_slug, delimiter='\t', skiprows=1)
print(type(data_alt_2))
print(data_alt_2[0:9])
<class 'numpy.ndarray'>
[[9.90e+01 6.70e-02]
[9.90e+01 1.33e-01]
[9.90e+01 6.70e-02]
[9.90e+01 0.00e+00]
[9.90e+01 0.00e+00]
[0.00e+00 5.00e-01]
[0.00e+00 4.67e-01]
[0.00e+00 8.57e-01]
[0.00e+00 5.00e-01]]

To import datasets which have different datatypes in different columns; for example, one column may contain strings and another float. The function np.loadtxt() will freak at this. There is another function, np.genfromtxt(), which can handle such structures. If we pass dtype=None to it, it will figure out what types each column should be.

data = np.genfromtxt(file_name, delimiter=',', names=True, dtype=None)
  • The first argument is the filename
  • The second specifies the delimiter
  • The third argument names tell us there is a header (names=True)
  • dtype=None is representing the different data type

There is also another function np.recfromcsv()that behaves similarly to np.genfromtxt(), except that its default dtype is None!

# Let’s look at titanic.csv 
# ‘,’, comma separated
# there is a header
# There is string and number data type in it.
titanic = np.genfromtxt(‘data/titanic_sub.csv’, delimiter=’,’, names=True, dtype=None)
print(titanic[0:2])
[(1, 0, 3, b'male', 22., 1, 0, b'A/5 21171', 7.25 , b'', b'S')
(2, 1, 1, b'female', 38., 1, 0, b'PC 17599', 71.2833, b'C85', b'C')
]
# using np.recfromcsv()
titanic_2 = np.recfromcsv('data/titanic_sub.csv')
print(titanic_2[0:2])
[(1, 0, 3, b'male', 22., 1, 0, b'A/5 21171', 7.25 , b'', b'S')
(2, 1, 1, b'female', 38., 1, 0, b'PC 17599', 71.2833, b'C85', b'C') ]

Importing Data using Pandas

What we learn so far is to import a bunch of different types of flat files into python as Numpy arrays. Although the numpy array is incredibly useful and has numerous of purpose, they can not handle the data as two dimensional labeled data structure.

Pandas offer us to the DataFrame, which has observations (rows), and variables(columns).

#import pandas 
import pandas as pd
#import titanic as dataframe
df_titanic = pd.read_csv('data/titanic_sub.csv')
print(type(df_titanic))#let's look at first 5 row
df_titanic.head()

Handling read_csv() function under different circumstances

  • sep: Stands for a separator, a default is ‘,‘ as in .csv(comma separated values): We could specify if needed.
  • header= None: Load a CSV with no headers
  • names=['column_name1','column_name2']: Load a .csv while specifying column names
  • index_col='date': Load a .csv with setting the index column to column name like we would like set index as a date.
  • na_values=['NA']: Load a .csv while specifying “NA” as missing values.
  • skiprows=3: Load a .csv while skipping the top 3 rows

Importing Other File Types into Python

  • Excel spreadsheets
  • Matlab files
  • SAS files
  • STATA files
  • HDF5 files

Pickle files: Pickling means converting the object into a sequence of bytes, or byte-stream.
Warning Loading pickled data received from untrusted sources can be unsafe. Firstly, we are going to create our pickle file from Titanic DataFrame.

#import pickle
import pickle
#Let’s save the Titanic data frame into a pickle file
pd.to_pickle(df_titanic, ‘data/titanic_pickle.pkl’)
#first try with with open
# 'rb' reading and binary
with open ('data/titanic_pickle.pkl','rb') as file:
data = pickle.load(file)
#reading the data type as DataFrame
print(type(data))
print(data[0:3])
<class 'pandas.core.frame.DataFrame'>
PassengerId Survived Pclass Sex Age SibSp Parch \
0 1 0 3 male 22.0 1 0
1 2 1 1 female 38.0 1 0
2 3 1 3 female 26.0 0 0
Ticket Fare Cabin Embarked
0 A/5 21171 7.2500 NaN S
1 PC 17599 71.2833 C85 C
2 STON/O2. 3101282 7.9250 NaN S
#---------------------------------------------------------------##Let's read with pandas
unpickled_df = pd.read_pickle('data/titanic_pickle.pkl')
#unpickled_df.head()

Excel File: You may need to deal with excel file at some point. Here, you’ll learn how to use pandas to import Excel spreadsheets.

  • Use the pandas’ ExcelFile() to load the .xlsx file
  • Figure out what is the sheets name using .sheet_names
  • We could load the particular sheets into a DataFrame for this we need to apply a parse() method with a single parameter, which is either the name of string or the index as a float of the sheets that you wish to load.
#load the battledeath.xlsx into the excel_data variable
excel_data = pd.ExcelFile(‘data/battledeath.xlsx’)
#print the sheet name
print(excel_data.sheet_names)
['2002', '2004']#Load the sheet into dataframe
df1 = excel_data.parse('2004')
df2 = excel_data.parse('2002')
print(df1.head())
War(country) 2004
0 Afghanistan 9.451028
1 Albania 0.130354
2 Algeria 3.407277
3 Andorra 0.000000
4 Angola 2.597931

It could be read using pd.read_excel() function.

# we could specify the sheet_name
excel_df = pd.read_excel(‘data/battledeath.xlsx’, index_col=0, sheet_name=’2004')

SAS File: Statistical analysis System

  • Business analysis and biostatistics
  • Advanced analysis, Multivariable analysis
  • Data management
  • Predictive analytics
  • sas7bdat: dataset files & sas7cdat: catalog file
#if the sas7bdat dosen't install your conda you could install first using followinf line.
pip install sas7bdat
# import the sas7bdat
from sas7bdat import SAS7BDAT
with SAS7BDAT ('data/sales.sas7bdat') as file:
df_sas = file.to_data_frame()
print(df_sas.head())
YEAR P S
0 1950.0 12.9 181.899994
1 1951.0 11.9 245.000000
2 1952.0 10.7 250.199997
3 1953.0 11.3 265.899994
4 1954.0 11.2 248.500000

Stata File: “Statistics” + “data”

  • Importing Stata files as DataFrames using the pd.read_stata()from pandas.
  • Academic & social sciences research
df_stata = pd.read_stata('data/disarea.dta')

HDF5: Hierarchical Data Format version 5

  • Storing for a large quantity of numerical data
  • Datasets can be hundreds of gigabytes, or terabytes
#h5py library
import h5py
df_h5py = h5py.File(‘data/L-L1_LOSC_4_V1–1126259446–32.hdf5’, ‘r’)
print(type(df_h5py))
print(df_h5py)
<class 'h5py._hl.files.File'>#we could explore the hierarhical structurefor key in df_h5py.keys():
print(key)
#Each of these HDF group which we could think thse group as directories.meta
quality
strain
print(type(df_h5py['meta']))
<class 'h5py._hl.group.Group'>
for key in df_h5py['meta'].keys():
print(key)
Description
DescriptionURL
Detector
Duration
GPSstart
Observatory
Type
UTCstart

Final Thought

You may find this study in my githup account as part of Datacamp repository.

I have written this article to improve my data analytic skills so I am still a learner. Please let me know any additional information or comment on this article.

Follow me on Twitter, Linkedin or in Medium.

--

--