Scraping Hub

Google Summer of Code 2020 Project Report

Scrapy | Python Software Foundation | Google Summer of Code

Code Links

H2ClientProtocol: #4769
HTTPNegotiateDownloadHandler:
#4770

Introduction

This project aims to implement HTTP/2 support adding an HTTP handler that can gracefully upgrade to HTTP/2 where possible, and take advantage of the compression and efficient gains.

Throughout the process, my mentors Adrián Chaves (Gallaecio), Eugenio Lacuesta (elacuesta) and Andrey Rahmatullin (wRAR) have been incredibly helpful. The regular reviews on the PR and weekly meetings helped a lot.

Key Advantages of HTTP/2

  • Single connection per resource: HTTP/2 uses one connection per resource, instead of one connection per file request. This means much less need for time-consuming connection setup, which is especially beneficial with TLS, because TLS connections are particularly time-consuming to create.
  • Faster TLS performance: HTTP/2 only needs one expensive TLS handshake, and multiplexing gets the most out of the single connection. As HTTP/2 will compress header data, and avoiding HTTP/1.1 optimizations such as file concatenation makes caching work more efficiently.

Community Bonding Period️

GSoC kick-started with Community Bonding. I started with looking into implementation of HTTP/2 Client by hyper to get a better picture. Simultaneouly, I read Scrapy’s codebase specifically the implementation of the HTTP11DownloadHandler to familiarize myself with the basic requirements of the project. The major difficulty I had was getting hold of event driven approach in Twisted.

My mentors guided me in designing a clean codebase architecture and we finalized on diving the project in two main components

  1. HTTP/2 Client Protocol
  2. Download Handlers

HTTP/2 Client Protocol

We used hyper-h2 which implementing a complete HTTP/2 protocol stack is built on a set of finite state machines. The code structure for the H2ClientProtocol is inspired from the H2Connection in Twisted. It consists of two main classes:

  • H2ClientProtocol
  • Stream

Download Handlers

There are 2 main download handlers introduced

H2DownloadHandler

HTTPNegotiateDownloadHandler

Timeline

First 3 weeks I dedicated completely to work on the H2ClientProtocol alone. During this time, RFC 7540 solved lots of my bugs & confusion during this time. Initially I was intimidated with Twisted. However, gradually I became comfortable working with it.

After 3 long weeks including lots of bugs, the client was working for both GET & POST requests. The client can handle large number of requests incoming at the same time over a single connection instance. Next week I spent in writing unit tests using the TwistedTrial framework.

After writing the tests, I started with

  • H2ConnectionPool which maintains a pool of all HTTP/2 connections. It works by creating a map from a key (derived from request URI) to the H2ClientProtocol instance. For example, if we get total N requests each having its own remote URL and there are M unique set of base URL, then there will be at most M connections maintained by the pool where M <= N always. For any request we simply check if we already have a HTTP/2 connection established then we’ll use it or create a new connection.
  • H2Agent which is responsible for issuing the request and internally using the H2ConnectionPool to establish new connection if required or use a cached connection. The H2Agent also wraps the context factory provided as an argument in the constructor using H2WrappedContextFactory which updates the ClientTLSOptions context to use only h2 as acceptable protocol during NPN or ALPN. The constructor signature of H2Agent is exactly same as of twisted’s Agent class such that it is easy to integrate into Scrapy.
  • H2DownloadHandler is the Scrapy’s way of issuing request. There are similar download handlers for HTTP/1.x and other protocols.

Apart from the above classes, I added an idle timeout in H2ClientProtocol using the twisted’s TimeoutMixin. So, if the connection is idle for too long (~240 seconds) then it will close itself and fire a Deferred which will be handled by H2ConnectionPool — such that any upcoming requests will not use up a closed connection & instead create a new one if required.

I do remember that while working on H2Agent, the _StandardEndpointFactory won’t establish a proper HTTP/2 connection. The only error that I had was “Connection was closed in an un-clean manner” which did not really help. The error stack was also not very helpful. I really had to deep dive for this which gave me some amazing insights on how Twisted & TLS Handshake works interally. I found that the connection was actually established but the problem was in the TLS Handshake. For some reason specifying the acceptable protocols as h2 in SSL.Context before the connection is even started to establish works but anything else — which includes updating the acceptable protocols list during the handshake do not work! To fix this I created a wrapper class which wraps any context factory which implements IPolicyForHTTPS and updates the acceptable procotols list.

After this, I started with HTTPNegotiateDownloadHandler which internally maintains a LocalCache which stores the negotiated protocols of earlier requests. NegotiateAgent uses ALPN or NPN (whichever is available) to negotiate a protocol (presently one of HTTP/1.1 or HTTP/2) from the remote server and issues the request to the respective download handler. Presently, all requests made via proxy are directly issued using the HTTP11DownloadHandler.

Do check my weekly blog on Python Software Foundatation website.

Conclusion

I am super happy to work with and contribute to Scrapy. Brainstorming to resolve errors, the solution to which was not even available on StackOverflow, gave me a new perspective to solve them by reverse engineering my way back to the root cause of the error which includes reading the source code of the internal libraries used.The amount of knowledge and experience I have gained during this Google Summer of Code is incomparable.

References

  1. https://developers.google.com/web/fundamentals/performance/http2
  2. https://tools.ietf.org/html/rfc7540

--

--

--

I love to code :)

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

SAFE DRIVE WITH TASK REMINDER

GSoC 2020: glue-solar project 2.2

Ep. 22 — Firestore vs. MongoDB+Realm (as of Jun 2020)

iOS Development: Isolate your dependencies.

How testers coded a mobile farm for iOS

Microservices with Clean & Hexagonal Architectures & DDD

Effective Java! Favor Generic Types

How to add a custom cop to Rubocop

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aditya Kumar

Aditya Kumar

I love to code :)

More from Medium

All about Outreachy Application Procedure and why did I get rejected there?

Introduction to Web Scraping -Simple Example

Auribises Technologies Internship Week 13

Internship Experience at MoveInSync Technologies

MoveInSync Technologies logo