How to get the next page on Beautiful Soup

DavidMM
Quick Code
Published in
6 min readAug 28, 2019

It is easy to scrape a simple page, but how do we get the next page on Beautiful Soup? What can we do to crawl all the pages until we reach the end?

Today, we are going to learn how to fetch all the items while Web Scraping by reaching to the next pages.

Video version of this tutorial

Getting Started

As the topic of this post is what to do to crawl next pages, instead of coding a Beautiful Soup script again, we are going to take the one we did previously.

If you are a beginner, please, do the ‘Your first Web Scraping script with Python and Beautiful Soup‘ tutorial first.

If you know how to use Beautiful Soup, use this starting code in repl.it.

This code fetches us the albums from the band the user asks for. All of them? No, just the first 10 ones that are displayed on the first page. By now.

Open a new repl.it file or copy-paste the code in your code editor: Now it’s time to code!

Refactoring — Getting rid of the clutter

Before adding features, we need to clean the clutter by refactoring.

We are going to take blocks of code and placing them in their own functions, then calling that functions where the code was.

Go to the end of the code and take the lines where we create the table:

Cut them and create a function, for example, export_table_and_print, and put it after base_url and search_url:

We also added a ‘clean_band_name’ so the filename where we store the data doesn’t have empty spaces and it is all lowercase, so “ThE BeAtLES” search stores a ‘the_beatles_albums.csv’ file.

Now, where the old code was, call the function, just at the end of the file:

The first part is done. Run the code and check it is still working. Also, if you are a beginner in Python, the best Python tutorials can help you in learning.

Go to the ‘for loop’ at around line 45. Take everything that involves in extracting values and adding them to ‘data’ (so, the whole code) and replace it with the ‘get_cd_attributes(cd)’.

After the last function, create that function and paste the code:

Again, run the code and check it is still working. If it is not, compare your code with mine:

t is working? Cool. Time to get ALL the albums!

Recursive function — The trick to get the next page

Ok, here’s the trick to get the job done: Recursiveness.

We are going to create a “parse_page’ function. That function will fetch the 10 albums the page will have.

After the function it is done, it is going to call itself again, with the next page, to parse it, over and over again until we have everything.

Let me simplify it for you:

I hope it is clear: As we keep having a ‘next page’ to parse, we are going to call the same function again and again to fetch all the data. When there is no more, we stop. As simple as that.

Step 1: Create the function

Grab this code, create another function called ‘parse_page(url)’ and call that function at the last line.

The data object is going to be used in different places, take it out and put it after the search_url.

We took the main code and created a parse_page function, called it using the ‘search_url’ as parameter and took the ‘data’ object out so we can use it globally.

In case you are dizzy, here’s what your code should look like now:

Please check this line:

Now we are not fetching the ‘search_url’ (the first one) but the URL that we pass as an argument. This is very important.

Step 2: Add recursion

Run the code again. It should fetch the 10 first albums as always.

That’s why because we haven’t used recursion. Let’s write the code that will:

  • Get all the pagination links
  • From all the links, grab the last one
  • Check if the last one has a ‘Next’ text
  • If it has it, get the relative (partial) url
  • Build the next page url by adding base_url and the relative_url
  • Call parse_page again with the next page url
  • If doesn’t has the ‘Next’ text, just export the table and print it

Once we have fetched all the cd attributes (that’s it, after the ‘for cd in list_all_cd’ loop), add this line:

We are getting all the ‘list item’ (or ‘li’) elements inside the ‘unordered list’ with the ‘SearchBreadcrumbs’ class. That’s the pagination list.

Then, we go to the last one and get the text. Add this after the last code:

Now we check if ‘next_page_text’ has ‘Next’ as text. If it does, we take the partial url, we add it to the base to build the next_page_url. If it does not, there is no more pages, so we can create the file and print it.

That’s all we need. Run the code, and now you are getting dozens, if not hundreds of items!

Step 3: Fixing a small bug

But we can still improve the code. Add this 4 lines after parsing the page with Beautiful Soup:

Sometimes there is a ‘Next’ page when the numbers of albums are multiple of 10 (10, 20, 30, 40 and so on) but there is no album there. That makes the code to end without creating the file.

With this code, it is fixed.

Your coding is done! Congratulations!

Conclusion

Let me summarize what we have done:

  • We moved blocks of code with the same functionality to functions
  • We put the scraping code inside a function and we call it passing the initial search_url
  • Inside the function, we scrap the code
  • After it is done, we check for the next URL
  • If there is a ‘next url‘, we call the function with the next page URL
  • If not, we end the scraping and create the .csv file

Now it seems simpler, right?

I want to keep doing tutorials like this one, but I want to ask you what do you want to see:

  • Do you want more Web Scraping with Beautiful Soup or Scrapy?
  • Do you want me to teach how to make a Flask web app or a Django one?
  • Or do you want to learn more Front-End things like Vue.js?

Please, leave me a comment with what do you want to see in future posts.

And if this tutorial has been useful to you, share it with your friends, on Twitter, Facebook or where you can help others.

--

--

DavidMM
Quick Code

Valencian Full Stack | Python | Django | DRF | Javascript | Vue | Flutter | Creator of http://letslearnabout.net/