Scraping Nasdaq data with Scrapy + Selenium

Rain Wu
Random Life Journal
5 min readNov 27, 2019

The previous article is about three months ago, it’s time to write some technical note~ I kick off my side-project for graduation these days , and the notes will be released here.

A Scraper?

Yep~ part of my side-project is a data collector gather financial data from sometarget websites automatically, and Nasdaq is the first website I select to test whether my scraper works.

Scrapy

Scrapy is a powerful web scraping framework in Python intergrated with lots of functions, such as process method for requests and responses, costimizing the data export pipeline…etc, below is the architecture of scrapy.

Photo of scrapy architecture on https://docs.scrapy.org/en/latest/topics/architecture.html

Although Scrapy is moooooooooore complicated than other scraping tools (e.g. requests, bs4), but it meets my requirements: multiple websites, cuntomize pipeline, and extensible needed, if you only wanna get a list of main course from a restaurant website, requests module is enough, don’t torture yourself.

To avoid story become too long, I didn’t plan to give a introduction here, go to scrapy document for more details~

Selenium

Some website use auti-scraping technique to prevent scrapers, well this is really annoying. Selenium is a tool that simulate the human user behaviour, in order to fake the robot detect mechanism, but this is far from enough, we still need other skills to scrape Nasdaq website.

Here’s the selenium document. study it yourself if you want to leran more.

Inside the Scraper

Spider

Firsttake a look at spider, below is the spider for one of indexes, called “NDX”

The xpath selector is what I use to extract the element of data table, according to the DOM of Nasdaq website.

And the TimeFrame is the customize data storage structure I defined in items.py, corresponding to the table head.

Unfortunately, the simple scraper like this will be blocked by Nasdaq website’s anti-scraping detection, we need to do some post-processing for our request via middlewares in scrapy.

Middlewares

Scrapy provide two kinds of middleware classes, SpiderMiddleware and DownloaderMiddleware, the one we select to deal with request is DownloaderMiddleware.

One of anti-scraping technique is to detect the User-Agent information in request header, the request from my script contains no User-Agent by default, so I need to add one, let it able to pretend as a browser, we can specified it directly to the request oboject:

Some website will also track the User-Agent information, if you request too many times with the same UA, you will got banned. So, I implement a hack solution here to scrape for a random User-Agent real-time, that means each request go through this middleware will carry a diffrent User-Agent.

The principle of proxy is just like the User-Agent, the request from our computer will have our own ip address by default, so we need some proxy between the website and our computer, help us send the request to website via different ip address.

Scraping for proxy real-time, too.

But here’s another problem, I can only use free proxy because I’m poor, sometimes the free proxy I get is already unavailable or not health enough, that will cause all kinds of trouble, like fall into a redirect loop, got 404 from proxy server, spend too much time to connect to the website…etc, so I won’t enable this middleware if not necessarily, User-Agent rotating is enough in most cases.

Finally, it’s time to interact with website to scrape the data, we use selenium to simulate the user behaviour:

At line 8 I extract the User-Agent and proxy from the original request, and let Chrome webdriver take over, sometimes we need to scale the window size to make the target element become visible.

I already disable the proxy server rotating here to avoid connection error cause by an unhealthy proxy, maybe it still need a error handle and retry with another proxy here. But you stiil can enable it for testing if you want.

Some website will popup some message box or panel that will occlide the target element we want to interact with, we have to deal with it first. BTW, the popup of Nasdaq website I scrape is a cookie using notification.

The data has been seperated in to about 70 pages on the website, we can collect all of them in one request via page turning in the middlewares to save time, it will cause high frequency communication between spider and downloader if you implement page turning in spider.

Pipeline

We need to specified a pipeline that defined how scrapers export the data, I use CsvItemExporter here:

Settings

To enable the pipelines and middlewares we mention above, stiil need to modify some settings in settings.py, I only paste the options that I have modified:

Scrapy will print a lot of logs by default, I raise the LOG_LEVEL to INFO to avoid mixed up.

ROBOTSTXT_OBEY is like a rule to limit the scraper’s behaviour, we are very happy to violate it.

DOWNLOAD_DELAY will decrease the request frequency to avoid got banned.

COOKIES_ENABLED define whether cookies will be used, some websites also track cookies to detect scraper, disable is recommended.

DOWNLOADER_MIDDLEWARES specified which middlewares we will use and the executing order of them.

ITEM_PIPELINE specified which pipeline we will use to export items hold by scrapers.

If you need different settings for each spider, you can refer to the following simple spider I use to test the Proxy and User-Agent rotating:

Download data directly

The website provides a button to dowload the data in a .csv file, you can dowaload it via Scrapy’s Files Pipeline, or manually if you feel uncomfortable with the scraper technique XD.

Repository

All the code above can be found here in my public repository~

Sometimes my scraper still got banned by the Nasdaq website, I don’t have enough domain knowledge to locate the key problem now, maybe some high quality proxies will help.

The data collection part of my side-project already reach a staged goal now, more spiders will be added for different finance data website in the future, but it’s almost time to move on to another part~

If you find this note helpful, please give me some claps and follow me to cheer me up~! The next note will be about data visualization, user interface or some related topic~ thanks for your reading !

--

--

Rain Wu
Random Life Journal

A software engineer specializing in distributed systems and cloud services, desire to realize various imaginations of future life through technology.