Data Wrangling with MongoDB (Lesson 2)
Study notes and mind-map for the free course of Data Wrangling with MongoDB in Udacity
Hello fellows! Today we will continue to extract data from different sources, move from CSV, Json, to XML and scraping data from HTML.

XML
- Basics of XML
Elements are the basic building blocks
Elements are composed of opening and closing tags
Elements can be nested in an element
- Two types of XML
(1) Documented oriented Type
For example, research article xml document
(2) Non-documented oriented type: attributes in the elements are
heavily used (HTML using this way)
For example: OpenStreet Map data
If you’re new to XML and or would like some extra resources, here are a couple useful tutorials:
- Parse XML
One way to parse XML document is to put entire XML into a tree in memory (good for documented type XML)
Use research article xml document for example
# iterate child of element
import xml.etree.ElementTree as ET
tree = ET.parse(filename)
root = tree.getroot()print “Children of root:”
for chile in root:
print child.tag # print the tag name# Element tree supports X path expressions
tree = ET.parse(filename)
title = root.find('./fm/bibl/title')
title_text = ''# print the title of article
for p in title:
title_text += p.text
print "\nText\n", title.text# print the email address
print "\nAuthor email Addresses:"
for a in root.findall("./fm/bibl/aug/au"):
email = a.find('email')
if email is not None:
print email.text# handle attributeselement.attrib[name] or element.get(name)
HTML (Web Scraping)
We are going to use an example to demonstrate this process. We will collect all the carriers and airports data, and then made request for each airport
and carrier, finally get the all departure and arrival info for each carrier-airport pair. The module, beautiful soup, we used to parse the html tree, is a similar process to parse xml tree
Web page: AirTrain website
Procedure
1. Build lists of carrier values and airport values
import requests
from bs4 import BeautifulSoup as bscarrierList = []
with open('VX_and_BOS.html', 'r') as html:
sp = bs(html, 'lxml')
carrierLayer = sp.find(id='CarrierList')
carriers = carrierLayer.find_all('option')
for carrier in carriers:
if 'All' in carrier['value']:
continue
carrierList.append(carrier['value'])# Collect airports except for the 'AllairportList = []
with open('VX_and_BOS.html', 'r') as html:
sp = bs(html, 'lxml')
airportLayer = sp.find(id='AirportList')
for airport in airportLayer.find_all('option'):
if 'All' in airport['value']:
continue
airportList.append(airport['value'])
2. Make http request to download all data (download html page, better for examine bugs in your parser)
Make correct http request needs:
(1) What http method has been used (post? or get?)
(2) What necessary fields are included in request?
Best way to make correct http request (see how web browser make request):
→ Inspect element in a web page
→ Network tab
→ Find method all the fields need be included in the request
→ Use the fresh made request and check if it is right
→ If not valid
→ Use requests.Session() to include all the cookies in your http request
# Find the cookies and put it into request
def parse_web(carrier, airport): s = requests.Session()
r = s.get('https://www.transtats.bts.gov/Data_Elements.aspx?Data=2') soup = bs(r.text, 'lxml')
viewstate_element = soup.find(id="__VIEWSTATE")
viewstate = viewstate_element['value']
eventvalidation_element = soup.find(id="__EVENTVALIDATION")
eventvalidation = eventvalidation_element['value']
viewstategenerator_element = soup.find(id="__VIEWSTATEGENERATOR")
viewstategenerator = viewstategenerator_element["value"] r = s.post('https://www.transtats.bts.gov/Data_Elements.aspx?Data=2',
data = (
("__EVENTTARGET", ""),
("__EVENTARGUMENT", ""),
("__VIEWSTATE", viewstate),
("__VIEWSTATEGENERATOR",viewstategenerator),
("__EVENTVALIDATION", eventvalidation),
("CarrierList", carrier),
("AirportList", airport),
("Submit", "Submit")
)) with open("{0}_and_{1}.html".format(carrier, airport), "w") as f:
f.write(r.text)
3. Parse the datafile and collect the data
import os
def get_files(datadir):
files = os.listdir(datadir)
return filesdatafiles = get_files(os.getcwd())
data = []
for datafile in datafiles:with open(datafile, 'r') as html:
page = bs(html, 'lxml')
rows = page.find_all('tr','dataTDRight')
for row in rows:
cells = row.find_all('td')
values = [cell.text for cell in cells]
if values[1] == 'TOTAL':
continue
entry = {}
info = re.split("[_.]", datafile)
entry['courier'], entry['airport'] = info[0], info[2]
entry['year'] = values[0]
entry['month'] = values[1]
entry['flights'] = {'domestic': values[2],
'international': values[3]}
data.append(entry)
Mind-map
Structure the key note in the mind-map, we can get the whole picture of the lesson 2. If you want to see the branch in detail, you could go up to check corresponding section.

As always, feel free to like and repost if you learned some from this post. And please don’t feel any hesitation to ask question and give advice.
