Data Science Cheatsheet — Part 1: Data Collection

Asif Anwar
Asif Anwar
Published in
4 min readNov 25, 2019

--

Overview:

Machine learning is such a vast field that it's really easy to lose track for the learning. And next thing you know you are searching Google, StackOverflow and GitHub again for the same content, code or instructions. So my intention here to write all the machine learning topics I have learned in a series of blog posts, so that when it's necessary, anyone (especially myself) can easily find it.

Data Collection:

Any data science project start with data collection. In the first part of my Data Science Cheatsheet series I will try to cover different ways to collect and store data for any data science project.

Most of my machine learning projects I try to do in the cloud environment. Currently, my favourites ones are Google colab and IBM cloud. Here are a few basics on importing data from Google Drive, Kaggle, HTTP links, files and database.

Google Drive:

If you are using google colab as I do most of the time, then importing data from Google drive is an easy and quick method. And also it is a good place to backup your output data files for future usages.

#Map Drive 
from google.colab import drive
drive.mount('/content/drive')
#Copy from the Drive
!cp drive/My\ Drive/<<folder name>>/<<file name>> .
#Backup to Drive
!cp <<file name>> drive/My\ Drive/<<folder name>>

Note: Above method works seamlessly with Google colab, however for other environments you may need to install google drive and mount folder based on folder structure of the environment. I will try to cover this in a later update

Kaggle:

It's hard to Imagine imagine data science projects without looking for some data in Kaggle. Kaggle provides a simple API to connect and download the necessary files. I used below commands to download files from Kaggle.

#Step 1: install and import libraries
!pip install -q kaggle --upgrade
import kaggle
import os
#Step 2: Define the Kaggle keys - can be taken from kaggle
os.environ['KAGGLE_USERNAME'] = "<<User ID>>"
os.environ['KAGGLE_KEY'] = "6d09d59780a9dc379ae73c5845a2e30e" #use your id
#Step 3: update the json file with keys
!echo '{"username":"USERNAME","key":"KEY"}' > ~/.kaggle/kaggle.json
#Step 4: Download file - API command from kaggle dataset
!kaggle datasets download -d gustavomodelli/forest-fires-in-brazil

HTTP sites:

Wget is a simple method to download data from the web. We can also use the request library to get the HTML code to collect data from.

#wget mathod
!wget -O <Desire file name> <url>
#urllib
from urllib.request import Request, urlopen
req_rent = Request('url', headers={'User-Agent': 'Mozilla/5.0'})
web_rent = urlopen(req_rent).read()
#Requets
import requests
response = requests.get('url')
response.content

Unzip files

Many times we end up downloading zip files. below codes are an easy way to unzip files that you have downloaded.

#unzip by using ZipFile module
from zipfile import ZipFile
# Create a ZipFile Object and extract
with ZipFile('<.zip file>', 'r') as zipObj:
zipObj.extractall()

Files Handling:

Reading and writing files are the most common activities in Data Science project. Below python codes can be used for reading and write files.

#Read files
with open("<locatio>/filename", "r") as file1:
FileContent = file1.read()
#print file content with different print command
print(FileContent)
print("first line: " + file1.readline())
# Iterate through the lines
for line in file1:
print("Iteration", str(i), ": ", line)
i = i + 1;
FileasList = file1.readlines()
FileasList[0]
#Write files
Lines = ["This is line A\n", "This is line B\n", "This is line C\n"]
with open('/resources/data/Example2.txt', 'w') as writefile:
for line in Lines:
print(line)
writefile.write(line)
# Copy file to another
with open('Example2.txt','r') as readfile:
with open('Example3.txt','w') as writefile:
for line in readfile:
writefile.write(line)
# Verify if the copy is successfully executed
with open('Example3.txt','r') as testwritefile:
print(testwritefile.read())

JSON Using API: (Example foursquare)

Based on API documentation, its important to define all the credential first and then use request command to get the JSON file

import json 
import requests
#Define Foursqaure credentials
client_id = "CLIENTID"
client_secret = "CLIENTSECRET"
version = "20180604"
limit = 100
radious = 500
latitude = 52.098712014323
longitude = 5.1195902720534
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(client_id, client_secret, latitude, longitude, version, radious,limit )results = requests.get(url).json()

Database

Now its time to access database by using python. here is a guide to open DB2 database in IBM cloud for the testing purpose. link

Database connection method:

!pip install -q ibm_db
import ibm_db
#Credential setup for database access
dsn_hostname = ""
dsn_uid = "svm48847" # e.g. "abc12345"
dsn_pwd = "njbxxr+13xvtg5p0" # e.g. "7dBZ3wWt9XN6$o0J"
dsn_driver = "{IBM DB2 ODBC DRIVER}"
dsn_database = "BLUDB" # e.g. "BLUDB"
dsn_port = "50000" # e.g. "50000"
dsn_protocol = "TCPIP" # i.e. "TCPIP"
dsn = ("DRIVER={0};" "DATABASE={1};""HOSTNAME={2};""PORT={3};""PROTOCOL={4};""UID={5};""PWD={6};").format(dsn_driver, dsn_database, dsn_hostname, dsn_port, dsn_protocol, dsn_uid, dsn_pwd)try:
conn = ibm_db.connect(dsn, "", "")
print ("Connected to database: ", dsn_database, "as user: ", dsn_uid, "on host: ", dsn_hostname)
except:
print ("Unable to connect: ", ibm_db.conn_errormsg() )

Database Queries:

#Create table
createQuery = "create table INSTRUCTOR(ID INTEGER PRIMARY KEY NOT NULL, FNAME VARCHAR(20), LNAME VARCHAR(20), CITY VARCHAR(20), CCODE CHAR(2))"
createStmt = ibm_db.exec_immediate(conn, createQuery)
#Insert command
insertQuery = "insert into INSTRUCTOR values (1, 'Rav', 'Ahuja', 'TORONTO', 'CA'), (2, 'Raul', 'Chong', 'Markham', 'CA'), (3, 'Hima', 'Vasudevan', 'Chicago', 'US')"
insertStmt = ibm_db.exec_immediate(conn, insertQuery)
#Select command
selectQuery = "select * from INSTRUCTOR"
selectStmt = ibm_db.exec_immediate(conn, selectQuery)
ibm_db.fetch_both(selectStmt) #fetch first line
while ibm_db.fetch_row(selectStmt) != False:
print (" ID:", ibm_db.result(selectStmt, 0), " FNAME:", ibm_db.result(selectStmt, "FNAME"))
#Update command
updateQuery = "update INSTRUCTOR set CITY='MOOSETOWN' where FNAME='Rav'"
updateStmt = ibm_db.exec_immediate(conn, updateQuery)
#Import file in Pands dataframe
import pandas
import ibm_db_dbi
pconn = ibm_db_dbi.Connection(conn)
selectQuery = "select * from INSTRUCTOR"
df = pandas.read_sql(selectQuery, pconn)

Conclusion:

Above are few example of common data importing methods. In future I will update this page with SQL Magic and BeautifulSoup methods. Please feel free to provide feedback and recommendations.

--

--