Data Wrangling with MongoDB (Lesson 2)

Yang Wang
Yang Wang
Jul 21, 2017 · 4 min read

Study notes and mind-map for the free course of Data Wrangling with MongoDB in Udacity

Hello fellows! Today we will continue to extract data from different sources, move from CSV, Json, to XML and scraping data from HTML.

XML

  • Basics of XML

Elements are the basic building blocks
Elements are composed of opening and closing tags
Elements can be nested in an element

  • Two types of XML

(1) Documented oriented Type

For example, research article xml document

(2) Non-documented oriented type: attributes in the elements are
heavily used (HTML using this way)

For example: OpenStreet Map data

If you’re new to XML and or would like some extra resources, here are a couple useful tutorials:

  • Parse XML

One way to parse XML document is to put entire XML into a tree in memory (good for documented type XML)

Use research article xml document for example

HTML (Web Scraping)

We are going to use an example to demonstrate this process. We will collect all the carriers and airports data, and then made request for each airport
and carrier, finally get the all departure and arrival info for each carrier-airport pair. The module, beautiful soup, we used to parse the html tree, is a similar process to parse xml tree

Web page: AirTrain website

Procedure

1. Build lists of carrier values and airport values

2. Make http request to download all data (download html page, better for examine bugs in your parser)

Make correct http request needs:

(1) What http method has been used (post? or get?)
(2) What necessary fields are included in request?

Best way to make correct http request (see how web browser make request):

→ Inspect element in a web page

→ Network tab

→ Find method all the fields need be included in the request

→ Use the fresh made request and check if it is right

→ If not valid

→ Use requests.Session() to include all the cookies in your http request

3. Parse the datafile and collect the data

Mind-map

Structure the key note in the mind-map, we can get the whole picture of the lesson 2. If you want to see the branch in detail, you could go up to check corresponding section.

As always, feel free to like and repost if you learned some from this post. And please don’t feel any hesitation to ask question and give advice.

)
Yang Wang

Written by

Yang Wang

Data Enthusiast, Life-long Leaner, Love everything beautiful

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade