Make a crawler with Django and Scrapy

Tiago Piovesan
5 min readAug 5, 2019

--

Hey everybody, how are you?

At first, sorry if maybe my english is not the best! But what I'll show here is a grate thing and very useful.

I have a mission in my job, they sent me make some crawlers to get useful data in that moment. However this work was kind of complicated in the point that I have to integrate the framework Django with framework Scrapy.

This because I can't find nothing about it in the docs of scrapy.

Well, after search for a long time and try too. I got it!!!

Versions:

Python 3.6.7

Django 2.1.3

Scrapy 1.5

Let's collect the best movies from 2019

In this exemple we'll get data in the website Rotten Tomatoes that is an aggregator of movies reviews. Well, the objective is capture the 100 best movies of 2019 according to them. Bring the informations like title, rating, poster, approval percentage and others.

Create a project Django

Supposing that you already have python in your computer. Do the following steps.

1. Initialize the project

$ django-admin startproject best_movies .

2. Create and initialize virtualenv

$ python3 -m venv .venv && source .venv/bin/activate

3. Install django 2.1.3

$ pip install Django==2.1.3

4. Create app Movie

$ python manage.py startapp movie

Make a model…

best_movies/movie/models.py

5. Register our application in the Django Admin

In movie/admin.py

from django.contrib import admin
from .models import Movie

class MovieAdmin(admin.ModelAdmin):
pass
admin.site.register(Movie, MovieAdmin)

6. And let's not forger to register our aplication

In best_movies/settings.py

INSTALLED_APPS = [
...
'django.contrib.staticfiles',
'movie',
]

In settigs we'll also need to put the following lines, this is to deal with our poster.

MEDIA_URL = '/media/'
MEDIA_ROOT = os.path.join(BASE_DIR, 'media')

7. We need the a lib Pillow to deal with our images too

As our field image(poster) need some specific treatments Django recommends us to use this library to do this job. At the terminal:

$ pip install Pillow

8. Add at URL settings to see the images locally

In best_movies/urls.py

urlpatterns = [
url(r'^admin/', admin.site.urls),
] + static(settings.MEDIA_URL, document_root=settings.MEDIA_ROOT)

9. Last thing, make the migrations

$ python manage.py makemigrations && python manage.py migrate

10. Create a user to you

$ python manage.py createsuperuser

It’s already done

At the terminal run python manage.py runserver and access: localhost:8000/admin

Installing Scrapy on project

1. Install lib's

Since we have the project, let's inside folder best_movies and install the lib scrapy.

$ pip install scrapy==1.5

And scrapy-djangoitem to connection scrapy with django models.

$ pip install scrapy-djangoitem==1.1.1

2. Initialize the project

Let's create a project with the name crawling

$ scrapy startproject crawling

Here is what your project structure should look like:

├─ crawling
│ ├─── __init__.py
│ ├─── items.py
│ ├─── middlewares.py
│ ├─── pipelines.py
│ ├─── __pycache__
│ ├─── settings.py
│ └─── spiders
│ ├─── _____init__.py
│ └─── _____pycache__
└─ scrapy.cfg

3. Bring our model create at Django previously

In crawler/items.py

import scrapy
from scrapy_djangoitem import DjangoItem
from movie.models import Movie

class MovieItem(DjangoItem):
django_model = Movie
image_urls = scrapy.Field()
images = scrapy.Field()

4. Settings.py

In crawler/settings.py

At the beginning of the file, place the following lines…

import os
import sys
sys.path.append(os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), ".."))
os.environ['DJANGO_SETTINGS_MODULE'] = 'best_movies.settings'
import django
django.setup()

And let's not forget that we don't want to make trouble to our friend Rotten Tomatoes, with multiple resquest. Add the following setting in crawler/settings.py.

DOWNLOAD_DELAY = 3

Another setting important too that we need to set here is the path where we’ll save the images. Remembering that MEDIA_ROOT is variable we set in the first part of the tutorial.

from best_movies.settings import MEDIA_ROOTIMAGES_STORE = MEDIA_ROOT

In this file, we will add the following ITEM_PIPELINES, which are responsible for handling the information collected.

ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 100,
'crawling.pipelines.CrawlingPipeline': 100,
}

Apparently we already have everything set up to make our first spider. Let’s go ?!

5. Create the first spider

Inside the spiders directory we will create a file called rottentomatoes.py

This file will tell you which path to take on the Rotten Tomatoes page so that the data can be collected correctly. Page paths can be specified in two ways CSS and/or Xpath. (View more)

I had choose Xpath and CSS at the same time, so you can see in practice how they both work.

crawling/spiders/rottentomatoes.py

What we are doing here is basically going through the table rows that have the table class and taking all the links that are in this path '/tr/td[3]/'. Later we made a for in each of these links and access each one respectively through the yield, which makes a new request.

Now we will create a callback for a new function called parse_item, which will actually be responsible for capturing each of the information we need through selects. Now made with CSS.
With all this process done, and with an instance of the MOVIE item already captured. Let’s return it, because the path it will now follow is the pipelines.py file.

6. Pipelines

Responsible for cleaning the data, the pipelines file will receive the information we obtained from scrapy and will handle the information as needed, following the simple example I made to remove spaces, remove characters, merge items from a list, etc.

crawling/pipelines.py

I tried to keep all cleaning methods with clean_ in front of the name to standardize, remembering that it would be ideal to separate the standardization functions into different files and not become a fruit salad of data cleansing codes.

In this same file I chose to create the logic to save the item in the database. Remembering that this can be improved and soooo much !! But it is already functional.

7. Let’s put it to work

Prepared??!! Just cross your fingers and run on best_movies/crawling

$ scrapy crawl rottentomatoes

In the console you can see in the console the real time capture of the data.

Well, if all went well, now when you open the Django admin template you’ll get a result similar to this:

We have the information of 100 best movies from 2018! 🎉

I hope you understood and liked the result, from here it’s up to you! Be creative, you can do a lot with it.

I’ll be making the tutorial code available on Github, so if you have any questions, you can consult there.

Tiago Ezequiel Piovesan

--

--

Tiago Piovesan

desenvolvedor full stack nas empresas Compras Paraguai e 9bits.