Let’s Parse the Web

Nil Madhab
Nov 28, 2020 · 12 min read

Introduction

Why Scrapy? (Optional, you can skip)

Overview

The Project

Setup

pip install --user flask scrapy

The Web App

from datetime import datetime
from functools import wraps
from flask import Flask, request, redirect, session, url_for, render_templateapp = Flask(__name__, template_folder="templates")
app.config["SECRET_KEY"] = "secret secret key"
dummy_data = [
{
"country": "Canada",
"gdp": "High",
"happiness": "High",
"food": [
"Elk",
"Mushrooms",
"Peanut butter",
"Ham",
"Crossaints",
]
},
{
"country": "America",
"gdp": "High",
"happiness": "Medium",
"food": [
"Beef",
"Beef",
"Beef",
"Ham",
"Sugar",
"Sugar"
]
},
{
"country": "Uganda",
"gdp": "Low",
"happiness": "High",
"food": [
"Rice",
"Beef",
"Bananas",
"Lion meet"
]
},
{
"country": "India",
"gdp": "Medium",
"happiness": "Medium",
"food": [
"Rice",
"Daal",
"Potatoes",
"Chicken",
"Beef",
"Spinach",
"Fish",
"Fish",
"Milk",
"Milk",
"Milk"
"Spices"
]
},
{
"country": "Russia",
"gdp": "High",
"happiness": "High",
"food": [
"Ham",
"Mayonnaise",
"Fish",
"Ice",
"Bread",
"Vodka",
"Vodka"
]
}
]
def login_required(viewfunc):
@wraps(viewfunc)
def decorate(*args, **kwargs):
if "logged_in" not in session:
return redirect(url_for("login"))
return viewfunc(*args, **kwargs)
return decorate
@app.route("/")
@app.route("/index")
@login_required
def index():
return render_template('index.html')
@app.route("/login", methods=['GET', 'POST'])
def login():
if request.method == 'GET':
return render_template('login.html')
if request.form.get("username") == "admin" and request.form.get("password") == "admin":
session["logged_in"] = str(datetime.today)
return redirect(url_for("index"))
return """
Login failed. Go <a href="{}">back</a>?
""".format(url_for("login"))
@app.route("/food_by_country")
@login_required
def food_by_country():
return render_template("food_by_country.html", data=dummy_data)
if __name__ == '__main__':
app.run(debug=True)
python __init__.py
Image for post
Image for post
<!DOCTYPE html>
<html lang="en">
<head>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@4.5.3/dist/css/bootstrap.min.css" integrity="sha384-TX8t27EcRE3e/ihU7zmQxVncDAy5uIKz4rEkgIXeMed4M0jlfIDPvg6uqKI2xXr2" crossorigin="anonymous">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta charset="UTF-8">
<title>{% block title %}{% endblock %}</title>
</head>
<body>
<div class="container">
{% block content %}{% endblock %}
</div>
<script src="https://code.jquery.com/jquery-3.5.1.slim.min.js" integrity="sha384-DfXdz2htPH0lsSSs5nCTpuj/zy4C+OGpamoFVy38MVBnE+IbbVYUew+OrCXaRkfj" crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/bootstrap@4.5.3/dist/js/bootstrap.bundle.min.js" integrity="sha384-ho+j7jyWK8fNQe+A12Hb8AhRq26LrZ/JpcUGGOn+Y7RsweNrtN/tE3MoK7ZeZDyx" crossorigin="anonymous"></script>
</body>
</html>
{% extends 'base.html' %}{% block title %}Welcome to the food server{% endblock %}{% block content %}
<p> To see favourite food by country, please click on <a href="{{ url_for('food_by_country') }}">this</a> link</p>
{% endblock %}
{% extends 'base.html' %}{% block title %}Login{% endblock %}{% block content %}
<div class="align-items-center" style="margin-top: 25%">
<form method="POST" action="/login">
<div class="form-group row d-flex justify-content-center">
<label for="username" class="col-sm-2 col-form-label">Username</label>
<div class="col-sm-5">
<input type="text" name="username" placeholder="Username" value="" class="form-control" id="username">
</div>
</div>
<div class="form-group row d-flex justify-content-center">
<label for="password" class="col-sm-2 col-form-label">Password</label>
<div class="col-sm-5">
<input type="password" name="password" placeholder="Password" value="" class="form-control" id="password">
</div>
</div>
<div class="d-flex justify-content-center">
<button type="submit" class="btn btn-primary">Log In</button>
</div>
</form>
</div>
{% endblock %}
{% extends 'base.html' %}{% block title %}Who likes what{% endblock %}{% block content %}
<table class="food-table table table-bordered table-sm text-center">
<thead>
<tr>
<th>Country</th>
<th>Income</th>
<th>Happiness</th>
<th>Food</th>
</tr>
</thead>
<tbody>
{% for item in data %}
{% set ll = item['food'] | length %}
<tr>
<td rowspan="{{ ll }}">{{ item['country'] }}</td>
<td rowspan="{{ ll }}">{{ item['gdp'] }}</td>
<td rowspan="{{ ll }}">{{ item['happiness'] }}</td>
<td>
{{ item['food'] | first }}
</td>
</tr>
{% for food in item['food'][1:] %}
<tr>
<td>
{{ food }}
</td>
</tr>
{% endfor %}
{% endfor %}
</tbody>
</table>
{% endblock %}
webapp/
├── __init__.py
└── templates
├── base.html
├── food_by_country.html
├── index.html
└── login.html
Image for post
Image for post
Image for post
Image for post

The Spider

scrapy startproject scraper
Image for post
Image for post
scrapy genspider countryfood localhost:5000
Image for post
Image for post
scraper/
├── scraper
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── countryfood.py
│ ├── __init__.py
└── scrapy.cfg
from urllib.parse import urlencode, urlsplitimport scrapy
class CountryfoodSpider(scrapy.Spider):
name = 'countryfood'
allowed_domains = ['localhost']
start_urls = ['http://localhost:5000/']
def parse(self, response):
# Check if we need to log in.
# After logging in, the server redirects to index.
if urlsplit(response.request.url).path == '/login':
yield response.follow(
response.css("form::attr(action)").get(),
method='POST', body=urlencode({
"username": "admin",
"password": "admin",
}), callback=self.parse_index, headers={
"Content-Type": "application/x-www-form-urlencoded",
})
# If we don't have to log in, then we are already in the index page.
else:
yield from self.parse_index(response)
def parse_index(self, response):
yield response.follow(
response.css("p > a::attr(href)").get(),
callback=self.parse_data,
)
def parse_data(self, response):
for tr in response.css("table.food-table tbody tr"):
tds = tr.css("td")
yield {
"country": tds[0].css('::text').get(),
"income": tds[1].css('::text').get(),
"happiness": tds[2].css('::text').get(),
"food": [i.strip() for i in tds[3].css('::text').extract() if i.strip()],
}

Some Attributes

parse Method

Generators and, yield and yield from

parse_index Method and Selectors.

Image for post
Image for post

parse_data Method

Image for post
Image for post

Running the spider

scrapy list
Image for post
Image for post
scrapy crawl countryfood -L WARN -o -:jl
Image for post
Image for post
scraper/
├── scraper
│ ├── scraper
│ │ ├── __init__.py
│ │ ├── items.py
│ │ ├── middlewares.py
│ │ ├── pipelines.py
│ │ ├── settings.py
│ │ └── spiders
│ │ ├── countryfood.py
│ │ ├── __init__.py
│ └── scrapy.cfg
└── webapp
├── __init__.py
└── templates
├── base.html
├── food_by_country.html
├── index.html
└── login.html

webtutsplus

Best web and mobile development tutorials

Nil Madhab

Written by

Developer @Booking.com | ex: Samsung, OYO | IIT Kharagpur | Entrepreneur, founder of simplecoding.dev | connect me https://twitter.com/Nilmadhabmondal

webtutsplus

Find the best tutorials and courses for the web, mobile. Learn in Java, Python, Sprintboot , Android, Node.Js, SQL, AWS, Docker & more.

Nil Madhab

Written by

Developer @Booking.com | ex: Samsung, OYO | IIT Kharagpur | Entrepreneur, founder of simplecoding.dev | connect me https://twitter.com/Nilmadhabmondal

webtutsplus

Find the best tutorials and courses for the web, mobile. Learn in Java, Python, Sprintboot , Android, Node.Js, SQL, AWS, Docker & more.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store