Crawling Dynamic Websites Using chromedp in Go

Julian Shen
Julian’s Playground
5 min readJun 29, 2024

--

(This article is translated from https://blog.jln.co/Shi-Yong-Chromedp-Pa-Qu-Dong-Tai-Wang-Zhan-Zi-Liao/ with ChatGPT)

In modern web development, JavaScript-driven dynamic websites are becoming increasingly common, presenting challenges for traditional HTML parsing tools. Traditional methods are no longer suitable, as the content of web pages is generated only after executing JavaScript. Parsing such websites requires more front-end knowledge.

chromedp is a toolset written in Go language that performs headless browser automation via Chrome's DevTools protocol. This allows developers to programmatically control the Chrome browser, making it easy to crawl and parse dynamically generated content. This article will introduce how to use chromedp to create a simple web crawler.

chromedp mainly functions by communicating with the Chrome browser through the DevTools protocol. This enables developers to simulate user actions, such as navigating to web pages, clicking buttons, filling out forms, and extracting dynamically loaded content. These operations are performed in headless mode, meaning the browser interface is not visible, thus improving performance and resource efficiency. Through this approach, chromedp can handle scenarios that traditional HTML parsing tools cannot, especially when dealing with dynamically generated content.

Simple Example Code

Here is a simple example of using chromedp to crawl website data. This example demonstrates how to navigate to a website, select some elements, submit a form, and extract the required data:

package main

import (
"context"
"fmt"
"log"
"time"

"github.com/chromedp/chromedp"
)

func main() {
// Create context
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()

// Allocate browser
ctx, cancel = chromedp.NewExecAllocator(ctx, chromedp.DefaultExecAllocatorOptions[:]...)
defer cancel()

// Create context with a 30-second timeout
ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
defer cancel()

// Execute tasks
var res string
err := chromedp.Run(ctx,
chromedp.Navigate("https://example.com"),
chromedp.WaitVisible(`body`, chromedp.ByQuery),
chromedp.SendKeys(`input[name="q"]`, "chromedp", chromedp.ByQuery),
chromedp.Click(`input[type="submit"]`, chromedp.ByQuery),
chromedp.WaitVisible(`#result-stats`, chromedp.ByQuery),
chromedp.Text(`#result-stats`, &res, chromedp.ByQuery),
)

if err != nil {
log.Fatal(err)
}

fmt.Println("Search Result Stats:", res)
}

In this example, we create a chromedp task that navigates to example.com, waits for the page to load, enters "chromedp" in the search box, clicks the submit button, then waits for the search result stats element to be visible, and extracts that information. This is a basic example showing how to use chromedp for basic web interaction and data extraction. You can expand this example as needed to implement more complex crawlers.

Next, we will use a more practical case. The following example demonstrates how to use chromedp to crawl the schedule of the Chinese Professional Baseball League (CPBL), which is implemented using Vue.js.

Analyze the document

The URL of the CPBL schedule website is https://cpbl.com.tw/schedule. First, we use “View Page Source” to inspect the web page and find that there is no schedule data in it. Additionally, we can find the following piece of code:

var app = new Vue({
el: "#Center",
mixins: [mixin],

Another piece of code for obtaining schedule information is as follows:

$.ajax({
url: '/schedule/getgamedatas',
type: 'POST',
data: filterData,
headers: {
RequestVerificationToken: 'PzmpuUOvS4z2zH_QhwgFQYTzVC82b0n2QH30wEOJ12kOWA6zeq0Yn7_6d2v_o-ZTWuNPe3HjrqsMqAHp9sL0F5KB4KM1:5jgubJ0tGDTK3cLm2JU7_bCw9JqLOG8j8yeNiWDhR4nnTACLXerDqmzB5chZv-iqY8m1ep6IirI3hAwRCPfNTU6jO_E1'
},
success: function(result) {
if (result.Success) {
_this.gameDatas = JSON.parse(result.GameDatas);
_this.getGames();
}
},
error: function(res) {
console.log(res);
},
complete: function () {
$("body").unblock()
}
});

It is clear that this is a web page written in Vue.js. We could try hitting its API, but seeing that token string, it might have some protection in place, so using chromedp might be simpler.

How do we start parsing? Using ChatGPT might be a good method. Open Chrome’s developer tools, go to the Elements tab where the final rendered webpage is displayed, save it as a file, and ask ChatGPT:

Asking chatgpt to analyze the document (in Chinese)

You can also ask it to design the data structure:

Generate data structure with chatgpt (in Chinese)

This can only serve as an initial reference. You can later ask it to write the code for you. However, I tried it, and it only produces an example, not the correct result, but it’s useful as a foundation for modification.

First, define the requirements. We need to write a function that can input the year, month, and type of game to obtain the schedule information.

Based on this information, let’s write the first rough version to experiment with:

type Game struct {
No int `json:"no"`
Year int `json:"year"`
Month int `json:"month"`
Day int `json:"day"`
Home string `json:"home"`
Away string `json:"away"`
Ballpark string `json:"ballpark"`
}

func getGameNodes(nodes *[]*cdp.Node) chromedp.Action {
return chromedp.ActionFunc(func(ctx context.Context) error {
ctxWithTimeout, cancel := context.WithTimeout(ctx, 900*time.Millisecond)
defer cancel()

chromedp.Nodes("div.game", nodes).Do(ctxWithTimeout)
for _, n := range *nodes {
dom.RequestChildNodes(n.NodeID).WithDepth(6).Do(ctxWithTimeout)
}

return nil
})
}

func selectMonth(month string) chromedp.QueryAction {
return chromedp.SetValue("div.item.month select", month, chromedp.ByQueryAll)
}

func selectYear(year string) chromedp.QueryAction {
return chromedp.SetValue("div.item.year select", year, chromedp.ByQueryAll)
}

func selectGameType(gtype string) chromedp.Action {
return chromedp.SetValue("div.item.game_type select", gtype, chromedp.ByQueryAll)
}

func fetchGamesByMonth(ctx context.Context, year string, month string) ([]Game, error) {
chromedp.Run(ctx, selectMonth(month),
chromedp.WaitVisible("div.ScheduleGroup"),
chromedp.Sleep(800*time.Millisecond),
)

var nodes []*cdp.Node
var mn string

chromedp.Run(ctx,
chromedp.Text(".date_selected .date", &mn),
getGameNodes(&nodes),
)

var games []Game = make([]Game, len(nodes))
for i, node := range nodes {
games[i].No, _ = strconv.Atoi(strings.Trim(node.Children[0].Children[0].Children[0].Children[1].Children[0].NodeValue, " "))
games[i].Ballpark = node.Children[0].Children[0].Children[0].Children[0].Children[0].NodeValue
games[i].Year, _ = strconv.Atoi(year)
monthInt, _ := strconv.Atoi(month)
games[i].Month = monthInt + 1
dataDate := node.Parent.Children[0].AttributeValue("data-date")
day, _ := strconv.Atoi(dataDate)
games[i].Day = day
games[i].Away = node.Children[0].Children[0].Children[1].Children[0].Children[0].AttributeValue("title")
games[i].Home = node.Children[0].Children[0].Children[1].Children[2].Children[0].AttributeValue("title")
}

return games, nil
}

This version of the code is rough but usable, heavily relying on NodeValue and AttributeValue to fetch values. The problem with this code is that each chromedp call requires a call to Chrome, achieved through the Chrome DevTools Protocol. The Chrome DevTools Protocol uses WebSocket for communication, making frequent round-trips inefficient and less stable.

Here is an optimized version based on methods learned from ChatGPT:

// FetchSchedule fetches the schedule from CPBL website based on the year, month, and game type
func FetchSchedule(year int, month int, gameType string) ([]GameSchedule, error) {
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()

var schedules []GameSchedule

// Define the URL
url := "https://cpbl.com.tw/schedule"

// Run chromedp tasks
err := chromedp.Run(ctx,
chromedp.Navigate(url),
chromedp.WaitReady(`.ScheduleTableList`), // Wait for year select to be ready
chromedp.Evaluate(fmt.Sprintf("document.querySelector('#Center').__vue__.filters.kindCode = '%s'", gameType), nil),
chromedp.Evaluate(fmt.Sprintf("document.querySelector('#Center').__vue__.calendar.year = %d", year), nil),
chromedp.WaitReady(`.ScheduleTableList`), // Wait for year select to be ready
chromedp.Evaluate(fmt.Sprintf("document.querySelector('#Center').__vue__.calendar.month = %d", month-1), nil),
chromedp.Evaluate(`document.querySelector('#Center').__vue__.getGameDatas()`, nil), // Wait for table to be visible
chromedp.Sleep(2*time.Second), // Wait for table to load
chromedp.Evaluate(`
(() => {
let schedules = [];
document.querySelectorAll('.ScheduleTable tbody .date').forEach(dateDiv => {
let date = dateDiv.innerText.trim();
let parent = dateDiv.parentNode;
parent.querySelectorAll('.game').forEach(gameDiv => {
let location = gameDiv.querySelector('.place') ? gameDiv.querySelector('.place').innerText.trim() : '';
let game_no = gameDiv.querySelector('.game_no') ? gameDiv.querySelector('.game_no').innerText.trim() : '';
let away_team = gameDiv.querySelector('.team.away span') ? gameDiv.querySelector('.team.away span').title.trim() : '';
let home_team = gameDiv.querySelector('.team.home span') ? gameDiv.querySelector('.team.home span').title.trim() : '';
let score = gameDiv.querySelector('.score') ? gameDiv.querySelector('.score').innerText.trim() : '';
let remark = gameDiv.querySelector('.remark .note div') ? gameDiv.querySelector('.remark .note div').innerText.trim() : '';
schedules.push({ date, location, game_no, away_team, home_team, score, remark });
});
});
return schedules;
})()
`, &schedules),
)

if err != nil {
return nil, err
}

return schedules, nil
}

This version extensively uses chromedp.Evaluate to embed JavaScript code directly in the web page, modifying Vue's internal values and directly returning the execution results to the Go program. This improves readability and avoids excessive communication with Chrome.

--

--