Using Puppeteer + Node.js to scrape text from YouTube videos
YouTube videos can be a great resource to learn about a variety of topics. Sometimes, I found myself wanting to copy and paste the code directly on the screen into my code editor. While there are browser extensions that allow you to do that, I wanted to create that functionality on my own as a web app that could run in any browser — no installation required.
This guide is intended for beginner developers who want to get acquainted with Puppeteer.
Create a RESTful API which will take two queries (YouTube video ID and timestamp) and return text.
If you’d prefer an alternative to using Google’s Cloud Vision API to read text, there are open source alternatives such as Tesseract.js.
Also, this guide works best with non-monetized videos. Videos with pre-roll or overlay ads will likely not work without additional tweaking.
Puppeteer API Documentation:
Google Cloud Vision API Documentation:
Note: we will be using async/await functionality with Puppeteer, which will require Node v7.6.0 or greater.
Create a project folder and initialize it:
$ mkdir youtube-text
$ cd youtube-text
$ npm init
Hit enter to use the default name, license, etc. or enter your own.
Then initialize the git repo:
$ git init
Don’t forget to create a .gitignore file:
$ npm install --save express puppeteer axios
If you haven’t yet, sign up for a Google Cloud API key. At the time of this article’s writing, you can sign up for a free trial which will provide credit. To use the Google Cloud Vision API you’ll need a billing-enabled account.
In our project root directory, let’s create a file named secrets.js: it will hold our API key when we run our app locally.
In our project root directory, let’s create a file named app.js:
Line 1–2: Import express and assign the app variable.
Line 3: Sets a listening port, using either an environmental variable or 8080.
Line 5: If the NODE_ENV variable isn’t ‘production,’ we import the secrets file, which assigns the Google API key variable. Necessary for using the Google Cloud Vision API.
Line 7–9: Our /api GET route. Just sends a 200 status for now.
Lines 11–13: App listener callback.
Status check #1
Running our app with
node app and navigating to http://localhost:8080/api should return an OK status.
Using Puppeteer to take screenshots
We’ll write the screenshot function now. For now, we’ll just take a screenshot of a webpage, and return it to the client.
Line 1: Import the module.
Lines 3: We declare a function that will take in a URL string and return a base-64 encoded image. We’ll make it
async in order to run each line sequentially.
Line 5: Launches a new browser instance. We could pass in an optional options object (see documentation), but that’s not necessary now.
Line 6: Opens a new tab.
Line 7: Navigates to the input URL.
Line 8: Change our browser viewport size.
Line 9: Generate the screenshot, with an options object. We are going to generate a base-64 encoded string, so we’ll use
Line 10–11: Closes the browser and returns the image. We don’t need to
browser.close() since we’ve already gotten our
(The Puppeteer API documentation goes through each of the various methods and functions used here; it’s very helpful!)
After you’ve added the above code to app.js, we’ll edit our GET route.
Line 2: Generates a base-64 encoded string screenshot of the passed-in URL (we’ve hardcoded it to http://youtube.com/ for now).
Line 3: Turns the base-64 encoded image and turns it into a binary buffer. (If it seems unnecessary to generate an image as a base64 string, only to turn it into a binary buffer, don’t worry — we’re going to come back to this later).
Line 4–8: Writes headers and sends the image file to the client.
Status check #2
At this point, when we navigate to http://localhost:8080/api, our Express app will generate a screenshot of the YouTube front page, then send it to the browser as an image.
Taking a screenshot of a video at a specific time
Let’s take a screenshot of a YouTube video now. How will we pass the video and time to the app? We’ll use query parameters!
For example, if we want to take a screenshot of this video at the specified time:
If we want to be RESTful, we could navigate to
Let’s change our
Let’s go through the changes:
Line 9: We select the video player DOM element from the YouTube web page.
Line 10–13: We hide the YouTube controls from the player element. If you want to see the controls/time elapsed in the screenshot, you can comment this section out.
Line 14: We simulate a user hitting the space bar in order to make Puppeteer play the video.
Line 15: We take a screenshot, but not of the entire page. Instead, we are taking a screenshot only of the
video element we selected on line 9.
So now, our
generateScreenshot function takes a screenshot of a YouTube video, but how do we pass in the correct parameters in? We’ll edit the GET route.
Our GET route will now look at params named
t,create a YouTube URL based on the params, then pass that new URL to our
Lines 2–5: Checks if the
t params were given. If not, it returns a helpful error message and ends the callback.
Line 6: Uses destructuring to assign the
t values from
Line 7: Construct a YouTube URL. (Note: this only works if the
t parameter is a single number, in seconds).
Status check #3
Navigate to http://localhost:8080/api?videoId=o3ka5fYysBM&t=1740. We should get a screenshot of the YouTube video at the correct time.
Getting text with Google Cloud Vision API
Now we need to send our image to Vision API.
First we’ll define a function that will take the screenshot we generated earlier, generate a request JSON object, then send that to the API with axios.
Line 1: We define the GOOGLE_API_KEY variable from the environmental variable, which should be set when we imported
Line 3: We’ll define our getText function as
async because of the POST request, and it will take the base64 encoded screenshot as an input.
Line 4–12: We define our request body JSON. We pass our screenshot in on line 6. Since it’s a base-64 encoded string, we can pass it in there to
content. If you had an image URL, you could assign the URL to
content instead. I defined the type of request as
‘DOCUMENT_TEXT_DETECTION’ on line 8 in case the image was text-dense.
Line 13–16: Our
axios request to the API.
Line 17: We return the results of the API request.
Let’s edit our GET route to return just the scraped text to the client.
Line 10: We invoke
getText with the
screenshot we created earlier, and assign the resulting JSON to
Line 11: This reaches a couple levels deep into the returned Google Vision JSON result and assigns the text to our
Line 12: Send the string back.
(Alternatively, if you’d like to see the entire resulting JSON from Google Vision/getText, change lines 11 and 12 to
res.json(text) , which will send the text along with position data.)
Status check #4
Navigating to http://localhost:8080/api?videoId=o3ka5fYysBM&t=1740 will return a string instead of an image.
Deploying to Heroku
If you’d like to deploy this to Heroku, there are some additions you’ll have to make:
Pass in an object with the above
args argument to
package.json, make sure to specify a
node version and a
"start": "node app"
You’ll also need to include a buildpack for puppeteer, in addition to the node buildpack. One way of adding it could be through the Heroku CLI:
$ heroku buildpacks:add https://github.com/jontewks/puppeteer-heroku-buildpack
$ heroku buildpacks:add heroku/nodejs
You should be able to deploy the app to Heroku. However, you may end up getting H12 errors, because Heroku has a 30 second timeout limit for requests. I’ll leave that part up to you to figure out.
Puppeteer is a pretty interesting tool. With a little bit of tweaking of the code I’ve outlined above, you can use it to generate things like GIFs. I’d love to get deeper into it and see what else is possible. If you have any comments, questions, or corrections, I’d love to hear them!
If you’d like to clone the code above, here’s the GitHub URL: