Parsing and Scraping: collecting information for machine learning purposes.
When studying machine learning we mainly concentrate on algorithms of processing data rather than on collecting that data. And this is natural because there are so many databases available for downloading: any types, any sizes, for any ML algorithms. But in real life we are given particular goals and of course any data science processing starts with collecting or getting information.
Today our life is directly connected with internet and web sites: almost any text information that we could need is available online. So in this tutorial we’ll consider how to collect particular information from websites. And first of all we`ll look a little inside html code to understand better how to extract information from it.
(at the end of the article there is a link to the entire jupyter notebook)
HTML “tells” web browsers when, where and what element to show at the html page. We can imagine it as a map that specifies the route for drivers: when to start, where to turn left or right and what place to go. That`s why html structure of webpages is so convenient to grab information. Here is the simple piece of HTML code:
Two tags ‘h1’ and ‘p’ point browser what and how to show on the page, and thus this markers are keys for us that help to get information that we exactly need. There is a lot of information about html and its main tags (‘h1’, ‘p’, ‘html’, etc. — this are all tags), so you can learn it more deeply, because we will focus on the parsing process. And for this purpose we will use BeautifulSoup Python library.
Before learning how to grab information directly online, let’s load a little html page as a text file from our folder. (click this link and download the file to folder with yupiter notebook), as sometimes we can work with already downloaded files.
Our file consists of several tags and the needed information is contained in the second ‘p’ tag. To get it, we have to “feed” the BeautifulSoup module with total html code, that it could parse through it and find what we need.
Function ‘.find_all’ collects for us all ‘p’ blocks from our file. We simply choose the necessary p-element from list by certain index and leave only text inside this tag.
But when there are many uniform tags (like ‘p’) in the code or when from page to page the index number of needed paragraph changes, such approach will not do right things. Today almost any tag contains special attributes like ‘id’, ‘class’, ‘title’ etc. To know more about it you can search for CSS and SSL. For us this attributes are additional anchors to pull from page source exactly the right paragraph (in our case). Using function ‘.find’ we’ll not get list but only 1 element (be sure that such element is only one in the page, because you can miss some information in other case.)
In the times of dynamic pages which have various CSS styles for different types of devices, you will be very often facing the problem of changing names of tag attributes. Or it could change a little from page to page according to other content on it. In case when names of needed tag blocks are totally different — we have to set up complex scraping “architecture”. But usually there are common words in that differing names. In our toy case we have two paragraphs with word “main” in ‘class’ attribute in both of them.
Also you can get out of two tags the code which contains a plenty of blocks with tags, delete them leaving only text inside them.
Let’s do things more complicated. First scraping task.
Imagine we have the task to analyze if there was correlation between the titles of major news and price for Bitcoin? On the one hand we need to collect news about bitcoin for defined period and on the other — price. To do this we need to connect a few extra libraries including “selenium”. Then download chromedriver and put it into the folder with yupiter notebook. Selenium connects python script with chrome browser and let us to send commands to it and receive html code of loaded pages.
One of the way to get necessary news is to use Google Search. First of all, it grabs news headlines from many sites and we don’t need to tune our script for that every news portal. The second thing is that we can browse news by dates. What we have to do is to understand how the link of google news section works:
“search?q=bitcoin” — the word or phrase we are searching for
“num=100” — number of headlines
“cd_min%3A12%2F11%2F2018” — start date “cd_max%3A12%2F11%2F2018” — end date
Let’s try to load news with word “bitcoin” for January 15, 2018.
We are lucky): correct page, correct search word and correct date. To move on, we need to examine html code and to find that tags (anchors) which let us grab necessary information. The most convenient way is to use “inspect” button of the right-click menu of Google Chrome web browser (or similar in other browsers). See the screenshot.
As we see ‘h3’ tag is responsible for block with news titles. This tag has attribute class=”r dO0Ag”. But in this case we can use only ‘h3’ tag as anchor because it used only to highlight titles.
There are a lot of additional tags inside ‘h3’ blocks, that`s why we use loop below to clear them keeping only text.
That’s all. We get 44 news titles dated January 15, 2018. Also we can grab a few starting sentences from that news and use them in future analysis.
But 1 day in history is nothing for correlation detection . That’s why we’ll create the list of 10 dates (for educational purposes) and set up scraping loop to get news for all of them. Tip: if you want to change the language of news, when the first page is loaded during script execution, change the language in the settings manually or in few minutes you’ll learn how to do this by algorithm.
But if it was so simply, it wouldn’t be so interesting.
First of all such scraping could be detected by web sites algorithms like the robots activity and could be banned. Secondly, some web sites hide all content of their pages and show it only when you scroll down the page. Thirdly, very often we need to input values in input boxes, click links to open next/previous page or click download button. To solve these problems we can use special methods to control browser.
Let’s open the example page at “Yahoo! Finance”: https://finance.yahoo.com/quote/SNA/history?p=SNA. If you scroll down the page you’ll see that the content loads up periodically, and finally it reaches the last row “Dec 13, 2017”. But when the page is just opened and we view page source (Ctrl+U for Google Chrome), we’ll not find there “Dec 13, 2017”. So to get data with all dates for this symbol, first of all we have to scroll down to the end and after that parse the page. Such code will help us to solve this problem (to learn different ways of scrolling look here https://goo.gl/JdSvR4):
There are many websites that prefer to divide one article for two and more parts, so you have to click ‘next’ or ‘previous’ buttons. For us it is the task to open all that pages and grab them). The same task is with multi page catalogues. Here is example: we will open several pages at stackoverflow.com tags catalogue and collect top tag words with their occurrences through the portal. To do this we will use find_element_by_css_selector() method to locate certain element on the page and click on it with click() method. To read more about locating elements open this: https://goo.gl/PyzbBN
Or here is another example: medium.com site hides the part of comments below articles. But if we need to analyze the “reasons” of popularity of the page, comments can play a great role in this analysis and it`s better to grab all of them. Open this page and scroll to the bottom — you’ll find that there is “Show all responses” button as “div” element there. Let’s click on it and open all comments.
Authorization and input boxes
A lot of information is available only after authorization. So let’s learn how to log in at Facebook. Algorithm is the same: find input boxes for login and password, insert text into them and then submit. To send a text to the inputs we will use .send_keys() method and to submit: .submit() method.
But this methods are also very useful when we need to change dates or insert values into input boxes to get certain information. For example, here is the “one page tool” to receive etf fund flows information: Etf Fund Flows. There are no special pages for each ETF (as Yahoo! has) to view or download desired values. All you can do: to enter ETF symbol, start and end dates and click button “Submit”. But if your boss set a task to obtain historical data for 500 etfs and 10 last years (120 months), you’ll have to click 60000 the button “submit”. What’s a dull amusement… So let’s make an algorithm that can collect this information while you’ll be raving somewhere at Ibiza party.
There is enormous amount of sites and each of them has its own design, access to information, protection against robots, etc. That’s why this tutorial could be as a little book. But at least one more approach of scraping information we’ll discover. It is connected with parsing dynamic graphs as like www.google/trends uses. Interestingly that Google’s programmers don’t allow to parse the code of trend graphs (the div tag which contains the code of graph is hidden) but let you download csv file with information (so we can use one of the above algorithms to find this button, to click and download file).
Let’s take another site where we can parse similar graphs: Portfolio Visualizer. Scroll down this page and you’ll find graph as at screenshot. The worth of this graph is that the historical prices for Us Treasury Notes are not freely available — you have to buy it. But here we can grab it either manually (rewrite dates and values correspondingly), or to write code which “rewrites” the values for us and not only from this page…
It’s important to admit that parsing activity can be easily determined as robot’s activity and you we’ll be asked to pass the “antirobot’s” captcha. On the one hand you can find solutions how to give the right answers to it, but on the other (i think more natural), you can set up such algorithms that will be similar to human activity, when they use web sites. You are lucky, when the website has no protection against parsing. But in case with Google News — after 10 or 20 loadings of page, you’ll meet google’s captcha. So try to make your algorithm more humanlike: scroll up and down, click links or buttons, be on the page for at list 10–15 seconds or more, especially when you need to download several thousands of pages, take breaks for hour and for night, etc.
Here you can download jupyter notebook.
Special thanks to Open Data Science community for free machine learning course which was held during fall 2018 within which this article was created.