Using Puppeteer + Node.js to scrape text from YouTube videos

David Jung
7 min readNov 19, 2018

--

YouTube videos can be a great resource to learn about a variety of topics. Sometimes, I found myself wanting to copy and paste the code directly on the screen into my code editor. While there are browser extensions that allow you to do that, I wanted to create that functionality on my own as a web app that could run in any browser — no installation required.

This guide is intended for beginner developers who want to get acquainted with Puppeteer.

Goal

Create a RESTful API which will take two queries (YouTube video ID and timestamp) and return text.

Note

If you’d prefer an alternative to using Google’s Cloud Vision API to read text, there are open source alternatives such as Tesseract.js.

Also, this guide works best with non-monetized videos. Videos with pre-roll or overlay ads will likely not work without additional tweaking.

Resources

Puppeteer API Documentation:
https://github.com/GoogleChrome/puppeteer/blob/v1.10.0/docs/api.md

Google Cloud Vision API Documentation:
https://cloud.google.com/vision/docs/

Other guides:
https://blog.georgi-yanev.com/projects/youtube-timestamp-screenshot/
https://codeburst.io/a-guide-to-automating-scraping-the-web-with-javascript-chrome-puppeteer-node-js-b18efb9e9921
https://timleland.com/headless-chrome-on-heroku/

Getting started

Note: we will be using async/await functionality with Puppeteer, which will require Node v7.6.0 or greater.

Create a project folder and initialize it:

$ mkdir youtube-text
$ cd youtube-text
$ npm init

Hit enter to use the default name, license, etc. or enter your own.

Then initialize the git repo:

$ git init

Don’t forget to create a .gitignore file:

node_modules
secrets.js

Download and install Express, Axios, and Puppeteer using NPM. Puppeteer will download a recent version of Chromium and will require a lot of space (~170MB Mac, ~282MB Linux, ~280MB Win).

$ npm install --save express puppeteer axios

If you haven’t yet, sign up for a Google Cloud API key. At the time of this article’s writing, you can sign up for a free trial which will provide credit. To use the Google Cloud Vision API you’ll need a billing-enabled account.

In our project root directory, let’s create a file named secrets.js: it will hold our API key when we run our app locally.

In our project root directory, let’s create a file named app.js:

Line 1–2: Import express and assign the app variable.

Line 3: Sets a listening port, using either an environmental variable or 8080.

Line 5: If the NODE_ENV variable isn’t ‘production,’ we import the secrets file, which assigns the Google API key variable. Necessary for using the Google Cloud Vision API.

Line 7–9: Our /api GET route. Just sends a 200 status for now.

Lines 11–13: App listener callback.

Status check #1

Running our app with node app and navigating to http://localhost:8080/api should return an OK status.

Using Puppeteer to take screenshots

We’ll write the screenshot function now. For now, we’ll just take a screenshot of a webpage, and return it to the client.

Line 1: Import the module.

Lines 3: We declare a function that will take in a URL string and return a base-64 encoded image. We’ll make it async in order to run each line sequentially.

Line 5: Launches a new browser instance. We could pass in an optional options object (see documentation), but that’s not necessary now.

Line 6: Opens a new tab.

Line 7: Navigates to the input URL.

Line 8: Change our browser viewport size.

Line 9: Generate the screenshot, with an options object. We are going to generate a base-64 encoded string, so we’ll use

{encoding: 'base64'}

Line 10–11: Closes the browser and returns the image. We don’t need to await the browser.close() since we’ve already gotten our image.

(The Puppeteer API documentation goes through each of the various methods and functions used here; it’s very helpful!)

After you’ve added the above code to app.js, we’ll edit our GET route.

Line 2: Generates a base-64 encoded string screenshot of the passed-in URL (we’ve hardcoded it to http://youtube.com/ for now).

Line 3: Turns the base-64 encoded image and turns it into a binary buffer. (If it seems unnecessary to generate an image as a base64 string, only to turn it into a binary buffer, don’t worry — we’re going to come back to this later).

Line 4–8: Writes headers and sends the image file to the client.

Status check #2

At this point, when we navigate to http://localhost:8080/api, our Express app will generate a screenshot of the YouTube front page, then send it to the browser as an image.

Taking a screenshot of a video at a specific time

Let’s take a screenshot of a YouTube video now. How will we pass the video and time to the app? We’ll use query parameters!

For example, if we want to take a screenshot of this video at the specified time:
https://www.youtube.com/watch?v=o3ka5fYysBM&t=1740

If we want to be RESTful, we could navigate to /api?videoId=o3ka5fYysBM&t=1740.

Let’s change our generateScreenshot function:

Let’s go through the changes:

Line 9: We select the video player DOM element from the YouTube web page.

Line 10–13: We hide the YouTube controls from the player element. If you want to see the controls/time elapsed in the screenshot, you can comment this section out.

Line 14: We simulate a user hitting the space bar in order to make Puppeteer play the video.

Line 15: We take a screenshot, but not of the entire page. Instead, we are taking a screenshot only of the video element we selected on line 9.

So now, our generateScreenshot function takes a screenshot of a YouTube video, but how do we pass in the correct parameters in? We’ll edit the GET route.

Our GET route will now look at params named videoId and t,create a YouTube URL based on the params, then pass that new URL to our generateScreenshot function.

Lines 2–5: Checks if the videoId and t params were given. If not, it returns a helpful error message and ends the callback.

Line 6: Uses destructuring to assign the videoId and t values from req.query.

Line 7: Construct a YouTube URL. (Note: this only works if the t parameter is a single number, in seconds).

Status check #3

Navigate to http://localhost:8080/api?videoId=o3ka5fYysBM&t=1740. We should get a screenshot of the YouTube video at the correct time.

Getting text with Google Cloud Vision API

Now we need to send our image to Vision API.

First we’ll define a function that will take the screenshot we generated earlier, generate a request JSON object, then send that to the API with axios.

(Documentation: https://cloud.google.com/vision/docs/detecting-text)

Line 1: We define the GOOGLE_API_KEY variable from the environmental variable, which should be set when we imported secrets.js .

Line 3: We’ll define our getText function as async because of the POST request, and it will take the base64 encoded screenshot as an input.

Line 4–12: We define our request body JSON. We pass our screenshot in on line 6. Since it’s a base-64 encoded string, we can pass it in there to content. If you had an image URL, you could assign the URL to content instead. I defined the type of request as ‘DOCUMENT_TEXT_DETECTION’ on line 8 in case the image was text-dense.

Line 13–16: Our axios request to the API.

Line 17: We return the results of the API request.

Let’s edit our GET route to return just the scraped text to the client.

Line 10: We invoke getText with the screenshot we created earlier, and assign the resulting JSON to text.

Line 11: This reaches a couple levels deep into the returned Google Vision JSON result and assigns the text to our response variable.

Line 12: Send the string back.

(Alternatively, if you’d like to see the entire resulting JSON from Google Vision/getText, change lines 11 and 12 to res.json(text) , which will send the text along with position data.)

Status check #4

Navigating to http://localhost:8080/api?videoId=o3ka5fYysBM&t=1740 will return a string instead of an image.

Deploying to Heroku

If you’d like to deploy this to Heroku, there are some additions you’ll have to make:

In the generateScreenshot function:

Pass in an object with the above args argument to launch() .

In package.json, make sure to specify a node version and a start script:

// ...
"engines": {
"node": "9.8.0"
},
"scripts": {
"start": "node app"
},
// ...

You’ll also need to include a buildpack for puppeteer, in addition to the node buildpack. One way of adding it could be through the Heroku CLI:

$ heroku buildpacks:add https://github.com/jontewks/puppeteer-heroku-buildpack
$ heroku buildpacks:add heroku/nodejs

You should be able to deploy the app to Heroku. However, you may end up getting H12 errors, because Heroku has a 30 second timeout limit for requests. I’ll leave that part up to you to figure out.

Final thoughts

Puppeteer is a pretty interesting tool. With a little bit of tweaking of the code I’ve outlined above, you can use it to generate things like GIFs. I’d love to get deeper into it and see what else is possible. If you have any comments, questions, or corrections, I’d love to hear them!

If you’d like to clone the code above, here’s the GitHub URL:
https://github.com/djung31/youtube-text

--

--