How to web or api scraping, become a data extraction developer, minimal things that must be known

Akhi Syabab Ahmad
5 min readApr 17, 2020

--

Hello, my name is Akhi (@akhisyabab). I am a python +2 years developer, focus on data extraction jobs although I also sometimes do web app / API jobs. Besides that, I became one of the python scrapers presenters at remoteworker.id. On this occasion I point to the python programming language.

If you want to dive into the world of data extraction, there are a number of things you will encounter when working on that task. But before I stress you must be strong in basic understanding such as convert data types, looping, lists, dictionaries, and others, because in data extraction you are required to process the data to produce what the client wants. There are several sources for generating data, namely:

API:
If you get a task for which the data source is from the API, then you can simply use the library request or other python library that has been created by the data source for us to retrieve the data, generally in the json type. Then we process the data and we store what the client wants. Of course sometimes the API needs a key, but this is usually the client that provides it.

Web page:
For this source, we can start using python requests. Then we will get variations in response data. We can get data from javascript response, it can also be from html response, depending on the source. To extract from html for example a table or etc., we can use a very popular library that is Beautifulsoup4 or you can use Scrapy.

In addition to having a strong understanding of basic, you must be able to read network activity. For chrome users, right click, choose inspect, then choose network. Like this:

So I will summarize some common cases that we will use often and I will discuss one by one.
1. Python Basic
2. Network Activity
3. HTTP Request
4. Parsing data from html (example using bs4 python library)
5. Write json file.
6. Read json data from file.
7. Write data to CSV or Excel file

  1. Python basic.
    There are many sources for learning basic python, can be from w3schools, python.org, and others as you wish.
  2. Network Activity.

The screenshot above is a way to see network activity in chrome, so the far left part of the network tabs is the activity list. Every time you load a page, you will see network activities from the site.

3. HTTP Request.
So there are several methods in the http request namely GET, POST, PUT, HEAD, DELETE, PATCH, OPTIONS. But I will give the example most often used, namely get or post. See the following example:

In the picture above, we can see, we access the Main_page wikipedia with the url https://en.wikipedia.org/wiki/Main_Page. Using get methods, then move to the response tab to see the response from the url.

4. Parsing data from html.
In python, for this task there are several libraries available. But the most popular ones are beautifulsoup4 and scrapy. This time I will show a bit using beautifulsoup. For complete information, you can visit https://www.crummy.com/software/BeautifulSoup/bs4/doc/

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Canada'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
print(soup)

Suppose we want to take the geography table on the right, how to do it like this:

import requestsfrom bs4 import BeautifulSoupurl = 'https://en.wikipedia.org/wiki/Canada'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
geography_area = soup.find('table', attrs={'class': 'infobox geography vcard'})print(geography_area.prettify())

5. Writing to json file.
In some circumstances you might need to create a json file from scraping. Here is a simple example of a python data dictionary that we will make into a json file.

import jsondata = {
'name': 'Akhi',
'age': '23',
'country': 'Indonesia'
}
with open('results.json', 'w') as outfile:
json.dump(data, outfile)

6. Reading from json file.

import jsonwith open('results.json') as json_file:
data = json.load(json_file)
print(data)

7. Writing data to csv or excel.
For this case I like the Pandas library to help me. Here I give an example:

import pandas as pd

data = [
{'name': 'Akhi', 'age': '23', 'country': 'Indonesia'},
{'name': 'Nick', 'age': '33', 'country': 'Canada'}
]

df = pd.DataFrame(data)
df.to_csv('results.csv', index=False)
df.to_excel('results.xlsx', index=False)

8. Read json files from folder.

files = sorted(glob.glob('./folder_name/*.json'),key=lambda x:float(re.findall("(\d+)",x)[0]))
all_datas = []
for file in files:
print(file)
with open(file) as json_file:
datas = json.load(json_file)
all_datas.append(datas)

Of course there is still a lot of knowledge that I have not yet conveyed, because you might find your own technique to handle certain cases. Hopefully this article can be useful. See you in my next post :)

--

--