Scraping Medium Posts using Scrapy

Published in

Job Automated

4 min readApr 10, 2018

I wanted a way to look at what people are writing on Medium about Data Science and here’s how I did it.

Medium is a great tool for posting and discovering content on latest topics and being an data enthusiast, I wanted to understand what people are writing on Data Science and what kind of articles are well-read. So I decided to build a crawler using scrapy — a python library.

To build any crawler, it is imperative to understand what requests are made to the server to fetch the data. To get this information, I used the “Network” tab in “Developer Tools” in Chrome to understand how requests are made and on this basis set the “header” and “cookie” information. Also, in the network tab, when you click on a query, you can check how the response is. This proved to be very useful, as when a request is made to Medium, it returns a JSON object in its response, which contains the information about the post, which meant all I had to do with the response was write the json output to a file, which can then be processed and stored in excel or a database.

Note: In this post, I am covering only the steps required to build the crawler. Will be covering how to process the extracted data and analyzing the data in future posts

Requirements

This code is written in Python 2.7 using Scrapy library. I have installed Python 2.7 using Anaconda distribution .

1. To install scrapy, run the following command

pip install scrapy

Cool! Let us now get started with writing the crawler

Code

create a folder for your project

mkdir medium_scrapper

2. go the folder you created and create a new python file (medium_scrapper.py )

We will start with a very basic scrapper python class that uses Scrapy.spider, which is a very basic Spider class provided by Scrapy. This class requires two attributes: name of the spider and the start_urls(which is a list of urls from where data has to be crawled). In our python file (medium_scrapper.py) add the following piece of code. In the below piece of code, I have set the autoThrottle_enabled to True, so that the crawler can automatically adjust the time delay between requests, based on the web servers request load.

import scrapy
class MediumPost(scrapy.Spider):
    name='medium_scrapper'
    handle_httpstatus_list = [401,400]
    
    autothrottle_enabled=True
    def start_requests(self):
        
        start_urls = ['https://www.medium.com/search/posts?q=Data%20Science']
        
        for url in start_urls:
            yield scrapy.Request(url,method='GET',callback=self.parse)
    
    def parse(self,response):
        pass

Let us understand each line of code. Two functions we have defined “start_requests” and “parse”. When we run the crawler, scrapy first calls the start_requests function. In start_requests() we crawl for each url in start_urls and call parse() function to run.

The parse() function, basically is to process the response and extract the crawled data.Since, we have not defined parse() function, the crawler will exit without doing anything.

Start your crawler using the following command in command prompt

scrapy runspider medium_scrapper.py

Once you run you should see output as below:

Output:
2018-04-11 00:09:57 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-04-11 00:09:57 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.14 |Anaconda, Inc.| (default, Nov  8 2017, 13:40:45) [MSC v.1500 64 bit (AMD64)], pyOpenSSL 17.5.0 (OpenSSL 1.0.2n  7 Dec 2017), cryptography 2.1.4, Platform Windows-10-10.0.16299
2018-04-11 00:09:57 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2018-04-11 00:09:57 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-04-11 00:09:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-04-11 00:09:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-04-11 00:09:57 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-11 00:09:57 [scrapy.core.engine] INFO: Spider opened
2018-04-11 00:09:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-11 00:09:57 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-11 00:09:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://medium.com/search/posts?q=Data%20Science> from <GET https://www.medium.com/search/posts?q=Data%20Science>
2018-04-11 00:09:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://medium.com/search/posts?q=Data%20Science> (referer: None)
2018-04-11 00:09:58 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-11 00:09:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
...
2018-04-11 00:09:58 [scrapy.core.engine] INFO: Spider closed (finished)

Let us now, write a simple parse function to process the response.In the parse function, we have to extract the JSON Response, write it to a file and also check if there is a next page and if yes then send a post request with header, cookie and formdata(request payload) to scrapy.Request(). You can get this information by checking the “Network” tab in “Developer Tools” in Chrome

class MediumPost(scrapy.Spider):
    name='medium_scrapper'
    handle_httpstatus_list = [401,400]
    
    autothrottle_enabled=True
    def start_requests(self):
        
        start_urls = ['https://www.medium.com/search/posts?q=Data%20Science']
        
        for url in start_urls:
            yield 
          scrapy.Request(url,method='GET',callback=self.parse)def parse(self,response):
              
        response_data=response.text
        response_split=response_data.split("while(1);</x>")
       
        response_data=response_split[1]
        filename="medium.json"
        writeTofile(filename,response_data)
        
        with codecs.open(filename,'r','utf-8') as infile:
            data=json.load(infile)
        #Check if there is a next tag in json data
        if 'paging' in data['payload']:
            data=data['payload']['paging']
            if 'next' in data:
                #Make a post request
                print "In Paging, Next Loop"
                data=data['next']
                formdata={
                        'ignoredIds':data['ignoredIds'],
                        'page':data['page'],
                        'pageSize':data['pageSize']
                        }               
                yield scrapy.Request('https://www.medium.com/search/posts?q=Data%20Science',method='POST',body=json.dumps(formdata),headers=header,cookies=cookie,callback=self.parse)

Your crawler is now ready!! Save the file and run the following command again

scrapy runspider medium_scrapper.py

The crawler should now crawl the data and save the json file. The json file can now be processed and you should see an output like this

2018-04-11 00:31:40 [scrapy.core.engine] INFO: Spider opened
2018-04-11 00:31:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-11 00:31:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-11 00:31:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <POST https://medium.com/search/posts?q=Data%20Science> from <POST https://www.medium.com/search/posts?q=Data%20Science>
2018-04-11 00:31:43 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://medium.com/search/posts?q=Data%20Science> (referer: https://medium.com/search?q=Data%20Science)
In Paging, Next Loop
2018-04-11 00:31:43 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <POST https://medium.com/search/posts?q=Data%20Science> from <POST https://www.medium.com/search/posts?q=Data%20Science>
2018-04-11 00:31:45 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://medium.com/search/posts?q=Data%20Science> (referer: https://medium.com/search?q=Data%20Science)
In Paging, Next Loop
2018-04-11 00:31:45 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <POST https://medium.com/search/posts?q=Data%20Science> from <POST https://www.medium.com/search/posts?q=Data%20Science>

You can check out the complete code here . Hope this article was helpful.

Scraping Medium Posts using Scrapy

Written by Aiswarya Ramachandran