Tutorial of “ Beautiful soup” and “ request”
Part 1 Tutorial of “ Beautiful soup” and “ requests”
The content of this tutorial is completely for beginners. If you need more advanced knowledge, this article may help you very limited.
First, confirm the code environment, we use python to develop the crawler. The version I am using here is python3.7, and Python 2.x can be run directly with a few changes.
The module we will use in this tutorial is the requests module. If you do not have the requests module installed, you need to install it first. If you are on the Windows side can press “win + r” and then call “Run”, then enter cmd to open the command line.
After opening the command line,
use the above code to install “Requests”.
Requests is a very handy module. You may see that there is a lot of code in the module that uses urllib. The reason why this tutorial does not use these modules is that they are more complicated than requests. As a beginner, we don’t have to scrap high. Often more important than tools is thinking. The main thing we learn is thinking, not the tool itself. The real harvest must be that you forget all that you have left after learning.
Ok, let’s not talk nonsense, start to get to the point!
Let’s start with a simple place — crawling Baidu’s homepage.
Open python and enter the code below
Look, we have already got the interface of Baidu.
Some of you may get a timeout error when they scraping this way. If there is such a situation, the website suspects that a robot is accessing itself, so it has made a certain block. Then what should we do? It doesn’t matter, let’s modify our code slightly and change it to
The meaning of headers is to tell the website that we are a normal browser sending information to it, please give us the correct information.
Similarly, many websites do not need to log in, they can directly access the content, such as Wikipedia, etc. can be directly scraped. How is it, is there a small sense of accomplishment?
You may ask, there is so much useless information in the web file, how do we extract it? At this time, we usually have two methods, one is a regular expression, and the other is to extract the content through the structure of the web page. Because regular expressions are more complex than the latter, they are not very friendly to novices. So this time our crawler uses the method of extracting the content of the web page directly to get the information.
Beautiful soup installation
Beautiful soup is another python module, we will use this module to decompose the structure of the web page and extract the content.
Similarly, Beautiful soup is a third-party module that we need to use
To install the module.
Let’s first analyze the structure of the webpage to be crawled, so that we can extract the content of the webpage more effectively for the structure of the webpage.
Here we take the anecdote as an example to explain.
The first thing is still that we first scraping its page. You can see that the content of the web page has been stored in the content variable, and it prints like this:
Next, we will analyze the structure of the website.
You may have found out that it is very difficult to analyze the results printed directly by us. So we use a more efficient tool, the Developer tools, for analysis. Usually, any browser has a developer tool, here we take Chrome as an example.
As you can see, as long as we mouse over the corresponding HTML tag, chrome will help us highlight the content of the page contained in the tag.
The content we want is very simple, it is the paragraph inside. So we right-click on the location of the paragraph and click on the Inspect element. Chrome will automatically find the label for the content.
You can see that the content of the paragraph we want is stored in this tag called span.
Let’s go ahead and see that the <span> tag belongs to a tag called <div class=” content”>. Moving on, we can see a tag called <div class=” article block untagged mb15" id =….>.
Click on the small triangle next to it and merge the contents of the label. We can see that there are a lot of labels in this format. And each label corresponds to a segment.
So obviously, as long as we extract such labels, we can get the paragraphs in the anecdote.
So now we have a clear direction — find all the div tags whose class is article block untagged mb15 typs_hot, and then get the contents of the span tag inside. This is the paragraph we are looking for.
How do you write it?
First, we convert the content we need into Beautiful soup.
You might be wondering what the meaning of ‘lxml’ is. In fact, this is because our content is a string of data, it is not a web page file. So here we tell Beautiful Soup that you will treat him as a web page file.
So far, we have put the content of the webpage into the beautiful soup, and then we are using the magic of Beautiful Soup to divide the webpage into pieces.
Do you remember? The result of our analysis is that all the segments are in the span tag in the div tag of the article block untagged mb15 on the web page. It seems a bit complicated, nothing, we do it step by step.
First, we break out all classes for the article block untagged mb15 tag:
We can print out the divs to see what it looks like.
The next thing we have to do is to take out the spans in these divs.
Let’s remove the last line and avoid unnecessary printing. Then extract the span inside each div
The meaning of this code is to take out each div in the divs (remember, there are all classes in the divs for the article block untagged mb15 typs_hot div). After we took it out, we said to it, you give me the text of the label you called span inside. So we put the jokes inside into a joke and printed them out.
Run the program, you can see that you have successfully scraped all the paragraphs on the anecdote page!
Part 2 Exploratory Data Analysis of Data scraped from Wikipedia Page
Xi Jinping has been front and center of China’s push to cement its position as a superpower, while also launching crackdowns on corruption and dissent.
A consummate political chess player who has cultivated an enigmatic strongman image, the leader of the ruling Chinese Communist Party has rapidly consolidated power, having his ideas mentioned by name in the constitution — an honor that had been reserved only to Mao Zedong until now.
A seven-man leadership committee unveiled in October 2017 included no obvious heir, raising the prospect that Mr.Xi intended to govern beyond the next five years. The Communist Party has now confirmed that aim, with a proposal to remove a clause in the constitution that limits the presidency to two terms.
Research question: The relationship of Xi jinping’s politic and life activities and impacts between Wiki page activity’s dynamic in different languages.
Big events on the top viewed days
- 2018–03–11 China’s Legislature Blesses Xi’s Indefinite Rule. (he national legislature lifted the presidential term limit and gave constitutional backing to expanding the reach of the Communist Party)
- 2018–10–25 President Xi Jinping opened the country’s 19th Party Congress last week with a three-hour, 30,000-word political report before 2,000 party leaders. Held once every five years, the Party Congress is China’s most authoritative institution, and the president’s “Political Report” is always a significant event.
- 2017–04–17 The Reign of Xi Jinping
Assumption of the reason for the differences :
Only the president of China attracted the attention of the American Press and people
Xi jinping’s politic and life activity and impact has a strong relationship with Wiki page activity’s dynamic. In general, his every international appearance reported by American press brings a page view peak on wiki pages, either in English or Chinese.