How to Export Thousands of Pages into Excel

This guide will show you how to efficiently extract thousands of pages for analysis.

One frequent request that comes from clients is, “How do you extract repetitive information on multiple pages into excel?” Even if you are a programmer who is good at web-scraping, you need to spend a long time (hours to days) on analysing the structure of all pages and write code accordingly. Listly can solve this problem in minutes with just a few clicks.

Identifying content to extract

Here, we will use a lawyer association’s directory, which is already open to the public: 
http://www.hklawsoc.org.hk/pub_e/memberlawlist

Each page (Fig 1.) has a person’s contact information. We will extract each URL link one by one using Listly and insert them into excel. Here are the steps to achieve this.

1) Extract the URL from each hyperlink associated with a lawyer name
2)
Add these extracted URLs into Listly’s “bulk URL feature”
3) Generate a spreadsheet that contains information fields on each lawyer

Most data extraction tools on the web have issues capturing the data when the content of each page is different from one another. In this case, we can see that some information on the lawyers is omitted. Building a tool that takes into account these different cases would be extremely tedious. Your main work should be data analysis, not data cleansing.

Fig 1. http://www.hklawsoc.org.hk/pub_e/memberlawlist/member.asp?id=938699
Fig 2. Compared to Fig 1., some fields are omitted.
Fig 3. No information sometimes.

1) Extract the URL from each hyperlink associated with a lawyer name

Scrolling to the bottom of the web page and clicking next multiple times will give you specific URL patterns as shown below:

http://www.hklawsoc.org.hk/pub_e/memberlawlist/mem_withcert.asp?name=&pg=1&sj=0
http://www.hklawsoc.org.hk/pub_e/memberlawlist/mem_withcert.asp?name=&pg=2&sj=0
http://www.hklawsoc.org.hk/pub_e/memberlawlist/mem_withcert.asp?name=&pg=3&sj=0
….etc
until final page http://www.hklawsoc.org.hk/pub_e/memberlawlist/mem_withcert.asp?name=&pg=186&sj=0

Notice how that the URL for each link after …?name=&pg=1&sj=0 changes from 1 to 2, and then 3.

We will add 186 of the URL’s into listly’s “bulk URL”. This will generate a spreadsheet that has all the lawyer names with hyperlinks attached to each name.

Once we have downloaded this excel sheet, we can write an excel macro to extract each URL one by one from each cell: Go to developer tools and insert the macro code:

Function GetURL(rng As Range) As String
On Error Resume Next
GetURL = rng.Hyperlinks(1).Address
End Function

Once saved, select the cell you want to extract each hyperlink from and type in =geturl(A1). This will give you the URL attached to the hyperlink on lawyer’s name. Repeat for all the cells in the column. In this case we have 3,306 URLs.

2) Add these extracted URLs into Listly’s “bulk URL feature”

Now that we have identified all of the URL links (3,306), we can use the “add bulk url” feature on Listly to extract all of the information fields on each lawyer.

Fig 4. Extracting data from bulk URLs at once.

Next copy and paste the list to Listly. (https://www.listly.io/data/add).

Fig 5. All URLs registered with successful results.
Fig 6. One of extraction results

You can get the result such as Fig 5. Go into the one of results and select the tab to export (Fig 6).

Fig 7. Select “URL Group” and write down strings to search.

Now, we are going to select the ‘URL Group’ option to export our links from the excel file in part 1. In this case we insert the, 3,306 URLs.

3) Generate a spreadsheet that contains information fields on each lawyer

TIP: You can change the ‘Export’ option to ‘Tab selected’ or ‘All cards except selected’. Both options enable user to select each card to include or exclude for exporting.

It’s time to download. Click the button, “Excel”.

Fig 8. Arranging type. 1-Card to 1-Row vs. Cards to 1-Row

Select the arranging type of data (Fig 8).

  • 1-Card to 1-Row (recommended): mostly used. The content in a single card will be written in a row.
  • Cards to 1-Row: The content in all cards of a tab will be written in a row. We can also call it 1-Tab to 1-Row.

In this case, you have to choose Cards to 1-Row. If you cannot understand what it is for, feel free to try both options. No credits are charged.

Fig 9. Note: More download wait time needed for larger URL sets

For your reference, it took 38 seconds to get Excel file from 3,306 URLs.

Fig 10. Thousands of contact information into a single spreadsheet.
Fig 11. Sorted by Column G using Excel’s data filter.

Finally, here’s the Excel file!

All of the data is arranged in each column in order of; English Name, Chinese Name, Addresses, etc. (Fig 10). You can sort the data however you want, in this case the data is sorted by column G in descending order using Excel’s sorting feature.

There was barely any coding at all. Only a few clicks to get all the data you need!

Mert Erkul has participated in editing this article. Thank you, Mert!