A self driving car project. Pt4.

8 min readFeb 29, 2024

Streaming video for jetson nano.

Summary

Two important tools widely used in computer vision are OpenCV and GStreamer. In this story i am going to integrate them in two python scripts for a very common use case: do video streaming from Jetson Nano towards a remote pc across a wifi network. Let’s take two very common scenarios:

Control the car with the joystick in order to acquire data (lidar, video). moving the car with a first person perspective. Example: i want to create a navigation map with lidar and i want to be able to moving around without having to physically chase the car.
Acquire video remotely and then apply any kind of processing in order to lighten the computational load of the Jetson Nano. Example: i want to send the video capture to a remote PC with a super gpu on board, where a neural network can classify objects and then send back the result of classification.

If you missed my previous stories..

Self driving car part 1: Introduction to the general architecture of the robot.
Self driving car part 2: Serial communication between Jetson and Arduino.
Self driving car part 3: Communicating with ROS in a LAN.

Before starting

Some considerations before starting:

Examples are implemented in python, both on Jetson and on remote PC so before moving further you will need to setup a proper python environment.
You must have the OpenCV python package installed with gstreamer (and optionally with cuda) support enabled on Jetson.
In order to do this, you must build OpenCV package from source. I leave some of the many references you can find in the web:
How to build OpenCV with gstreamer and cuda enabled on Jetson Nano
How to build OpenCV with gstreamer and cuda enabled on Jetson Nano
Build OpenCV on Jetson Nano
I am using a remote PC with Windows. I built OpenCV from source with gstreamer support and python bindings support. I did not build opencv with cuda support enabled because of the large amount of time (you should consider at least 1hr of building time for modern pcs) and because not interested in using opencv library intensively. Truth is that the build process was driving me crazy :). Furthermore, in theory, i should benefit hardware acceleration (decoding from h264 to raw format) in any case using gstreamer backend with nvidia libraries -here on windows -but i was not able to successfully run any tests at the moment.

About GStreamer and OpenCV
GStreamer is a multimedia framework wich allows you to create a chain made by several processing blocks connected together (pipeline). Every block has a variable numbers of input or output pads (ports where video or audio signal can flow through). OpenCV allows you to create objects to capture /write video signals ( from /to different sources / sinks). OpenCV allows you to use gstreamer backend passing a string which describes this pipeline and the relative flag cv2.CAP_GSTREAMER as parameters.

On Windows PC

This first script shows how to capture the a streaming video coming from a UDP source and then display it in a window.

import python_cv_stk as stk

stk.load_gstreamer()

import cv2

def launch():

 pipeline = "udpsrc port=1234 ! " \
            "application/x-rtp, payload=96, encoding-name=(string)H264 !" \
            "rtph264depay !  " \
            "h264parse ! " \
            "decodebin ! " \
            "videoconvert !" \
            "appsink sync=false"

    try:
        cap = cv2.VideoCapture(pipeline, cv2.CAP_GSTREAMER)

    except Exception as exc:
        print(str(exc))
        cap.release()
        cv2.destroyAllWindows()
        exit(1)

    while True:
        ret, frame = cap.read()
        if ret:
            cv2.imshow("Jetson Camera", frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()


if _name_ == '__main__':
    launch()

def load_gstreamer(base_path="C:/gstreamer/1.0/msvc_x86_64/"):
    bin_path = os.path.join(base_path, "bin")
    gstreamer_path = os.path.join(base_path, "lib/gstreamer-1.0")
    modules_path = os.path.join(base_path, "lib/gio/modules")

    os.add_dll_directory(bin_path)
    os.add_dll_directory(gstreamer_path)
    os.add_dll_directory(modules_path)

    libs = [
        "gstudp.dll",
        "gstrtp.dll",
        "gstvideoparsersbad.dll",
        "gstd3d11.dll",
        "gstlibav.dll",
        "gstd3d11.dll",
        "gstd3d11.dll",
        "gstnvcodec.dll",
        "gstnvcodec.dll",
        "gstqsv.dll"
    ]

    for lib in libs:
        ctypes.cdll.LoadLibrary(lib)

In the first lines i am calling the function load_gstreamer() which is a part of my framework python_stk . The function loads all the minimum number of dlls (remember, i’m on windows!) in order to make gstreamer work with openCV. It’s important to underline that this function must called before importing cv2 package otherwise it will throw an error.

pipeline = "udpsrc port=1234 ! \
 application/x-rtp, payload=127, encoding-name=(string)H264 ! \
 rtph264depay ! \
 h264parse ! \
 decodebin ! \
 videoconvert ! \
 appsink sync=false"

This string represents a gstreamer pipeline configuration. It lists the chain of processing blocks where a single block is connected to the next one with a ! character. This pipeline is composed by the following elements:

udpsrc: is a network source that reads UDP packets from the network, port is a parameter that specifies the port number to receive the packets from.
application/x-rtp: a cap (capability) that describes the format of the data as RTP flowing between two pads .
rtph264depay: extracts H264 video from RTP packets.
h264parse: parses the data in H.264 format and make it compatible with other GStreamer elements, such as decodebin.
decodebin: decodes the data from any format of multimedia data (such as H.264) to raw data.
videoconvert: converts the video data from any color format to a common format.
appsink: sends the data to an external application, such as OpenCV.

The string is passed to the cv2.VideoCapture object specifying the gstreamer backend with cv2.CAP_GSTREAMER.

cap = cv2.VideoCapture(pipeline, cv2.CAP_GSTREAMER)

The reading function ret, frame = cap.read() is put inside a while loop to process the video streaming continuously. If the video ends or it is interrupted by some key, all resources are released and the program ends. This concludes the capture part on the remote machine, now let’s move on the Jetson board.

On Jetson

#!/home/polpybot/miniforge3/envs/PyTorchProjects/bin/python
import cv2
from cv_bridge import CvBridgeError


class CameraNode():

    def __init__(self):

        frame_width = 640      
        frame_height = 480
        fps = 30

        camSettings='nvarguscamerasrc !  \
            video/x-raw(memory:NVMM),format=NV12, framerate=30/1 ! \
            nvvidconv flip-method=0 ! \
            video/x-raw, width=' + str(frame_width) +', height=' + str(frame_height) + ', format=BGRx ! \
            videoconvert ! \
            video/x-raw, format=BGR ! appsink'

        output_streaming_settings_slow ='appsrc ! \
            videoconvert ! 
            x264enc tune=zerolatency bitrate=500 speed-preset=superfast !\
            rtph264pay ! \
            udpsink host=192.168.1.241 port=1234'

        output_streaming_settings ='appsrc ! video/x-raw, format=BGR ! \
            videoconvert ! \
            video/x-raw,format=BGRx ! \
            nvvidconv ! \
            video/x-raw(memory:NVMM),format=NV12 ! \
            nvv4l2h264enc ! \
            h264parse ! \
            rtph264pay pt=96 ! \
            udpsink host=192.168.1.241 port=1234 sync=false'

        img_with_color = True
        try:
            self.capture = cv2.VideoCapture(camSettings, cv2.CAP_GSTREAMER)

            self.recorder = cv2.VideoWriter(output_streaming_settings,
                                            cv2.CAP_GSTREAMER,
                                            fps,
                                            (frame_width, frame_height),
                                            img_with_color)

        except Exception as exc:
            print(exc)

        if not self.capture.isOpened():
            raise Exception("cannot open camera")

        print(" open camera ok!")

    
    def release_resources(self):
        cv2.destroyAllWindows()
        self.capture.release()
        self.recorder.release()


    def run_main(self):

        while True:
            success, image = self.capture.read()
            if success:
                try:
                    self.recorder.write(image)
                except CvBridgeError as error:
                    print(error)
            
            if cv2.waitKey(1) == ord('e'):
                break


if __name__ == '__main__':

    cameraNode = CameraNode()
    try:  
        cameraNode.run_main() 
        cameraNode.release_resources()

    except Exception as exc:
        cameraNode.release_resources()
        print(str(exc))
        SystemExit(1)

Here we have basically a class CameraNode with two objects:

an instance self.capture of the cv2.VideoCapture class
an instance self.recorder of the cv2.VideoWriter class

self.capture = cv2.VideoCapture(camSettings, cv2.CAP_GSTREAMER)

self.recorder = cv2.VideoWriter(output_streaming_settings,
                                 cv2.CAP_GSTREAMER,
                                 fps,
                                 (frame_width, frame_height),
                                 img_with_color)

The first object captures the video from the Jetson camera, the second sends the result of acquisition to a remote host via udp connection. You will notice that i am passing other useful informations to recorder object: frame rate (fps), video size (frame_width, frame_height) and a boolean which tells if image is black and white or not.
This process is done inside the run_main() function:

while True:
            success, image = self.capture.read()
            if success:
                try:
                    self.recorder.write(image)
                except CvBridgeError as error:
                    print(error)
            
            if cv2.waitKey(1) == ord('e'):
                break

Let’s see in detail the pipeline setting of the capture object:

camSet='nvarguscamerasrc !  \
 video/x-raw(memory:NVMM),format=NV12, framerate=30/1 ! \
 nvvidconv flip-method=0 ! \
 video/x-raw, width=' + str(frame_width) + ', height=' + str(frame_height)+ ', format=BGRx ! \
 videoconvert ! \
 video/x-raw, format=BGR ! appsink'

nvarguscamerasrc: is a proprietary element from NVIDIA which uses the nvidia library libargus to acquire an image from a sensor and processing it into a final output image.
video/x-raw(memory:NVMM),format=NV12, framerate=30/1: specifies that the raw video with format NV12 on memory buffer NVMM (nvidia gpu buffer)
nvvidconv flip-method=0 : It’s a proprietary Nvidia library that allow to convert between raw video formats and nvidia formats and vice versa. Also provides functions, as in this case, to rotate the video of 0 degrees.
video/x-raw, width=str(frame_width) + height=+ str(frame_height), format=BGRx: converts the video to cpu memory in a new format BGRx with width and height choosen by the user.
videoconvert: an element that converts the previous video format to a new video format.
video/x-raw, format=BGR: BGR format
appsink: as before it tells who is going to use this frame, the opencv application.

Now, let’s see the output pipeline:

output_streaming_settings ='appsrc ! video/x-raw, format=BGR ! \
            videoconvert ! \
            video/x-raw,format=BGRx ! \
            nvvidconv ! \
            video/x-raw(memory:NVMM),format=NV12 ! \
            nvv4l2h264enc ! \
            h264parse ! \
            rtph264pay pt=96 ! \
            udpsink host=192.168.1.241 port=1234 sync=false'

video/x-raw, format=BGR
videoconvert
video/x-raw, format=BGRx
nvvidconv
video/x-raw(memory:NVMM), format=NV12

These first elements in the chain do the opposite of what we did before, the raw video is converted from the format BGR to BGRx with videoconvert element then it is moved on nvidia memory buffer NVMM with a format NV12, this because we want to be able to use nvidia proprietary encoder. This point is critical and using a nvv4l2h264enc instead of a x264enc can make the difference in terms of video latency as we will see later.

nvv4l2h264enc: it’s the h264 nvidia proprietary encoder which takes the raw video as input and transforms to a compressed h264 format output
h264parse: It’s a parser that splits the H.264 bitstream into NAL units
rtph264pay pt=96 This is a payload-encoder that packs H.264 NAL units into RTP packets (RTP is a protocol for streaming media over networks) The pt=96 parameter sets the payload type to 96, which is a dynamic value that can be negotiated with the receiver.
udpsink host=192.168.1.241 port=1234 sync=false: It is a sink element that sends UDP packets to a host specified by an ip address and a port. The sync=false parameter disables synchronization to the clock, which means the element will send packets as fast as possible.

As you will probably notice, i left as reference in the code the “slow streaming pipeline” which does not use nvidia hw optimized libraries, and encoding is done by cpu by x264enc element. This explains why the cpu version is 2.5 x slower then gpu version.

output_streaming_settings_slow ='appsrc ! \
            videoconvert ! 
            x264enc tune=zerolatency bitrate=500 speed-preset=superfast !\
            rtph264pay ! \
            udpsink host=192.168.1.241 port=1234'

Some latency tests

I report here some latency tests i did with and without hardware acceleration enabled.

Setup

Tests are done capturing a 50 fps youtube video with the Jetson camera. Streaming is done using the two pipelines as indicated. I recorded with my mobile phone both source and streamed video. The source video has a timecode running with a resolution of 1/ 50 s , which means that a read difference between source and destination of n frames will results in x = n / 50 seconds of delay.

Without hw encoding:

Streaming video without hw encoding

I can measure a difference of 12 frames circa between source and destination which roughly results in 240 ms of delay.

With hw encoding

hw encoding video streaming

As expected, things are definitely better if we encode the video with hardware acceleration. I notice a difference of 5 frames which roughly results in 100 ms of delay. (Consider the error in measurement due to the timecode resolution of 20ms circa and the actual camera frame rate of 30 fps which implies an uncertainty of approximately 30ms).

Conclusions

With more or less 100ms of delay, remote driving experience with the joystick is quite satisfactory and pretty funny. I will keep on making some adjustments in order to have better performances and i will update this story according to my findings. Next time we will see how to integrate a very common neural network for object detection, so stay tuned!

And if you like this content, please consider to follow me on Medium :)

Happy reading, Stefano.