Published in


MIT 6.S094: Deep Learning for Self-Driving Cars 2018 Lecture 5 Notes: Deep Learning for Human Sensing

  • Data:
    Enormous amounts of real data is required. Data collection is the hardest and most important part.
  • Semi-Supervised:
    The raw data needs to reduced to meaningful representative cases, raw data needs to be annotated.
    We need to collect data and use semi-supervised techniques to find pieces of data that can be used to train our networks
  • Efficient Annotation:
    Good annotation allows good performance.
    Annotation techniques for different scenarios are completely different. ex: Annotation tools for glance classification Vs Annotation for Body Pose Estimation Vs Image Pixel level labelling for SegFuse
  • Hardware:
    Large amount of data needs large scale distributed compute and storage.
  • Algorithms:
    We want algorithms that can self calibrate, allowing generalisation.
  • Temporal Dynamics:
    Current algorithms are majorly image based Vs Temporal/Sequence based.

Human Imperfections

  • Distracted Driving:
    3,179 people were killed and 431k+injured in crashes involving distracted driving during 2014.
  • Eyes off the road:
    5 seconds is the average time, your eyes are off the road while texting.
  • Drunk Driving:
    Accountable for 31% of the traffic fatalities of 2014.
  • Drugged Driving:
    23% of night drivers are drugged drivers (2014).
  • Drowsy Driving:
    3% of all traffic fatalities involved a drowsy driver.
  • Humans might tend to ‘over-trust; the System.
  • Two+ One Cameras.
  • Camera 1: Capturing HD Video of the Face for Glance Recognition and Estimating Cognitive Load.
  • Camera 2: (FishEye) Estimating Body Pose-Hands on Wheel, Activity Recognition.
  • Camera 3: Recording scenario outside for Full Scene Segmentation.
  • Human Behaviour.
  • Deploying Autonomy.
  • Design of Algorithms for Training the Deep Neural Nets for Perception Tasks.
  • The Dataset shows that the Physical Engagement remains the same with/without the Autopilot on.
  • So, Autopilots allow being physically engaged. But the Systems aren’t over-trusted by the Driver

Pedestrian Detection


  • Different appearances: Inter class variations.
  • Different Articulations.
  • Occlusion accessories.
  • Pedestrians occluding each other.
  • Haar Cascade.
  • HOG.
  • CNN.
  • Fast R-CNN.
  • Mask RCNN.
  • VoxelNet.
  • Using a CNN classifier to detect if there is an object of interest present.
  • Use Non-Maximum Suppression to remove overlapping bounding boxes.
  • 10 hours recorded every day.
  • Aprrox 12,000 Pedestrians Crossing.
  • 21M+ samples of feature vector.
  • RCNN does a bounding box detection of Pedestrians.

Body Pose Estimation


  • Finding Joints in the image.
  • The landmark points in the image.
  • To determine the alignment of the driver.
  • Note: The general airbags are deployed assuming the Driver is facing towards the front.
  • With Increased automation, this assumption might fail.
  • Detect the Hands and then detecting in steps, the shoulders and so on.
  • Traditional Method
  • Powerful, successful for Muti-person, multi pose detection.
  • Performing a regression of Detecting parts from the complete image individually rather than a sequential detection.
  • Later, it stitches the detected joints together.
  • Allows detection of varying poses, and joints that aren’t visible.
  • CNN that take in a raw image and produce a X-Y position of an estimate of each joint.
  • Every estimation zooms in and produces repeated finer detection and estimation of joints.
  • We can use this approach to detect parts in a picture with multiple people.
  • First, the body parts are detected without doing individual person detection first.
  • Next connect them together.
  • Through Bi-Partitite matching, stitch the different people together.
  • This is the approach used by MIT, to detect upper body parts.
  • Position of Driver Vs Standard Front facing position.
  • Plot of Time Vs Position of Neck.
  • Body Pose Estimation for Pedestrians.
  • This allows detecting the dynamics of ‘non-verbal’ communication that happen when a pedestrian crosses the road and looks at the vehicle.
  • Interesting Discovery: Most people look away from the approaching vehicle before crossing the road.

Glance Classification:

  • Determining where the Driver is looking.
  • Note: This isn’t the same as gaze detection, where we try to find (x,y,z) pose.
    We classify two regions: On-Road/Off-Road.
    Or Six Regions:
    - On Road
    - Off Road
    - Left
    - Right
    - Instrument Panel
    - Rear-View Mirror
    The classification allows it to be addressed as a ML problem.
  • The same can be extended to Pedestrian to determine if they are looking at/away from the approaching vehicle.
  • Note: The ground data is provided by manual annotation.
  • Designing algorithm that are able to detect the individual landmarks of the face and estimating pose of the head.
  • Source Video
  • Calibration:
    Determining where the sensor is, since its region based.
  • Video Stabilisation.
  • Face Detection.
  • Face Alignment.
  • Eye/Pupil Detection.
  • Head (and eye) pose estimation.
  • Classification.
  • Decision Pruning.

Driver State Detection:

  • Many ways to taxonomize emotion.
  • General Case of detecting emotion.
  • Ex: Affectiva SDK.
  • These algorithms map our expressions to Emotion.
  • Application Specific Emotion Recoginiton:
    - Ex: Using Voice based GPS Interaction
    - Self Annotated.
    - The generic emotion detectors fail here because while driving, ‘smile = frustration’.
    - Thus Annotation matters. The data must be labelled to reflect these situations.

Cognitive Load:

Degree to which a person is mentally busy.

  • Eyes expand and contract based on Cognitive load, movements reflect deep thought too.
  • Cognitive load can be detected with blink dynamics, eye movement and pupil dilation.
  • However, in real world lighting leaves out pupil dilation.
  • Blink dynamincs, eye movement are utilised.
  • 3D convolutional NN:
    - A sequence of images is inputed, we use 3D convolutions.
    - Convolve across multiple images/channels.
    - Allow learning dynamics through time.
  • Real World Data:
    N-back tasks to estimate cognitive load.
  • We detect face, extract eyes and feed these into a CNN.
  • Plot of Eye Movement Vs Cognitive Load.
  • Standard 3D CNN Architecture.
  • Accuracy on real world data.

Human Centred Vision for Autonomous Vehicles:

  • Even though we are researching on perception, utilising sensors for localisation and path planning. We are still distant from solving this (Argument: 20+ years).
    So, Human has to be involved.
  • Thus, the ‘Robot’ needs to understand the ‘Human’s activity’ and the Human-Robot interaction needs to be refined.
  • Path to Mass Scale Automation.
    (No more steering wheels)
  • Human Centred Autonomy:
    - A SDC is a personal robot rather than a Perception Control System.
    - The ‘Transfer of Control’ involves a ‘Personal’ Connection with the machine.
    - SDCs will be wide reaching.
  • Teaser: MIT SDC will debut on the public streets (Public Testing) in March, 2018.
  • What Next?
    - DeepTraffic
    - DeepCrash
    - SegFuse



Elijah McClain, George Floyd, Eric Garner, Breonna Taylor, Ahmaud Arbery, Michael Brown, Oscar Grant, Atatiana Jefferson, Tamir Rice, Bettie Jones, Botham Jean

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sanyam Bhutani

Machine Learning Engineer and AI Content Creator at H2O.ai, Fast.ai Fellow, Kaggle x3 Expert (Ranked in Top 1%), Twitter: https://twitter.com/bhutanisanyam1