Scraping Live Stream Video With Python

Reshawn Ramjattan
Jul 18, 2020 · 5 min read

Scraping live streams can be a good source of data for computer vision projects. There’s an endless variety of public online streams including traffic, security, nature and entertainment but not all live streams provide on-demand archives. Hence scraping the stream itself can be useful.

Screenshot of Decorah Eagles Livestream by Explore.

In this tutorial we use selenium to initialize and drive a headless live stream and download the stream’s video chunks by grabbing the source URLs from the network logs.

What to Scrape

Live streams are fed and consumed in chunks of video, to access these chunks manually you can do the following:

In a chrome tab with a running live stream, open Dev Tools > Network tab > XHR. The stream of .ts (transport stream) files being fetched are the raw video stream data and can be downloaded and played, they are the same as .mpeg files. These files are what we want to automatically download and put together.

There’ll be an occasional .m3u8 file containing a list of the sources of the next set of .ts chunks. This can be used to grab the chunks in small batches instead of one by one but either way the point is to just grab the chunks from the source URLs.

Note: For non-live video streams a full .mp4 file of the video might be served and the source URL could be found here, but for some sites and online players, a similar stream of .ts files is used.

Initializing Stream: Selenium

To automate the downloads we need something to initialize the stream and, in some cases, keep it alive. We can use selenium, an automated front-end testing framework. The install is straight forward and the docs cover it well, the only thing that’s needed other than a pip install is a driver for the browser you want selenium to use, which is explained with hyperlinks to the downloads in the doc. Note that the version of the driver used needs to match the version of the browser currently installed.

from selenium import webdriverdriver = webdriver.Chrome()
driver.get('https://twitch.tv')

The above opens the automated browser to the provided link but we can add arguments through options to change the window size, or make it headless.

from selenium import webdriveroptions = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(options=options)
driver.get('https://twitch.tv')

We also need to define our logging preferences as ALL so we can get the video chunk URLs from them.

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
caps = DesiredCapabilities.CHROME
caps['goog:loggingPrefs'] = {'performance': 'ALL'}
options = webdriver.ChromeOptions()
options.add_argument("--window-size=1920,1080")
driver = webdriver.Chrome(desired_capabilities=caps,options=options)
driver.get('https://twitch.com')

Some streams won’t start immediately, they might have some ads or might need to be started by a click. Selenium commands can trigger clicks when we want by targeting the html tags, css, xpath, text, etc. See docs here more.

driver.find_element_by_id("play-button").Click();

Regular python time.sleep() can also be useful for waiting on elements to exist or waiting out ads but this part really depends on the target stream site itself.

Grabbing Video Chunks

Once the stream is started we need the .ts file links from the network logs.

def process_browser_log_entry(entry):
response = json.loads(entry['message'])['message']
return response
while True:
browser_log = driver.get_log('performance')
events = [process_browser_log_entry(entry) for entry in browser_log]
events = [event for event in events if 'Network.response' in event['method']]

In the above we load the logs as json objects and limit the logged events kept in the array to only Network Responses since that’s all we care about.

From here, all we need to do is iterate through the events, checking the response URLs for any that end in .ts, fetch the data from that link and then write to a file.

for e in events:
if e['params']['response']['url'].endswith('.ts'):
url=e['params']['response']['url']
r1 = requests.get(url, stream=True)
if(r1.status_code == 200):
with open('testvod.mpeg','ab') as f:
for chunk in r1.iter_content(chunk_size=1024):
f.write(chunk)
else:
print("Received unexpected status code {}".format(r1.status_code))

To combine all the files all we needed to do was append all of the fetched .ts data to one file.

The stream will be continuously downloaded to the resulting .mpeg file for as long the browser instance is running. If you ran it in headless mode you can close all running instances with

driver.quit()

P.S:

While the above method can work for sites that stream video data in the described way, certain sites will require adaptations. Since the example focused on Twitch, I’ll discuss an adaptation I missed at the time of writing and was pointed out by Thesidehaseyes.

There’s two key areas adjustments need to be made for Twitch.

  1. Error 400: Some requests that seem like valid .ts (Transport Stream) files actually return nothing.
  2. Selenium network log activity doesn’t capture the entire session (at least when I tried it)
  3. Repeat URLs: On Twitch requests are frequently made to the same chunk, therefore our code must prepare for this or else our video file won’t work as we expect.

To address the network activity, we can use a proxy server like Browsermob. This will sit between our automated browser and Twitch, monitoring all network activity that passes through our connection and logging it.

With the full network activity logged, we can add two conditions that address the other required adaptations. Ensuring the response to the HTTP request to the .ts resource holds data will help us avoid ERROR 400s. Meanwhile maintaining a list of fetched URLs will be how we ensure no duplicate fetches.

The following code should be sufficient to incorporate the above changes, to our aforementioned method.

A couple things to note about the above code.

  1. You can download the browsermob binary here
  2. Since we’re using a proxy server, the selenium connection to Twitch won’t have HTTPS, so in the desired capabilities we tell selenium to accept that and prevent the Chrome warning.

With the stream running and BMP logging activity, we can scrape simultaneously or kill the server and browser and then scrape the logs.

The following code parses the HAR log file from BMP.

In the above code we parse through the logs for requests we haven’t fetched yet, ending with .ts, that had valid data responses. The response is also in the log but the encoding was off and a quick fix was to just send another request but this creates the possibility of resources timing out so that should be noted.

You can quit the BMP server and selenium with

server.stop()
driver.quit()

The Startup

Get smarter at building your thing. Join The Startup’s +725K followers.

Reshawn Ramjattan

Written by

Grad student. Main interest in data science. Likes pirates.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +725K followers.

Reshawn Ramjattan

Written by

Grad student. Main interest in data science. Likes pirates.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +725K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store