Customize to develop an in-game self-driving vehicle with American Truck Simulator

Mike Chaykowsky
Jan 23 · 17 min read

I want to give a big thanks to the team for not only open-sourcing a fantastic library for deep learning but also providing the resources to learn it. I also want to thank for providing big insights into a lot of the different methods for screen display and manipulation using opencv and PIL.

I have been watching the lessons for the past several weeks and wanted to test some of the methods out on a personal project. I thought an autonomous vehicle project would be especially intriguing and seemed to fit well with the methods we have been learning in the course. I’m operating on a 13" MacBook Pro with OS Sierra — this, unfortunately, leaves my options pretty small for vehicle simulator games. However, I discovered the game American Truck Simulator in Steam and this seemed to fit my needs. My thought was to have two neural networks operating on two different tasks. One was essentially inspired by the Nvidia Research paper from a while back on End-to-End Learning for Self Driving Cars where the researchers essentially took video input and mapped it to driver steering wheel angle. If this method works then there’s no reason it shouldn’t work with key presses. This first neural network is similar in that I trained it entirely on video input data and key presses. The second neural network performs object detection to determine if there are other cars, trucks, trains, people in the path of the truck. Most of the methods used here with library are from Lesson 3 and Lesson 9 of the deep learning course.

import numpy as np
import pandas as pd
import time
from numpy import ones,vstack
from numpy.linalg import lstsq
from statistics import mean
import os
from collections import Counter
from random import shuffle
from datetime import datetime
from IPython.display import Image
from mss import mss
from PIL import ImageGrab
import cv2
import pyautogui
from pynput import keyboard

First, we want to develop a method for pressing keys (testing the model) and tracking when keys are being pressed (saving training data) using python. For this, I use the libraries pyautogui and pynput, respectively.

The structure for the key press functionality was derived from, but the method is quite different because I am on a mac and he is on a PC. Turns out it’s actually much easier on a mac!

This is the first bit of trickery I should point out — I’m developing in Jupyter Notebooks and since the key presses are controlling my system remotely I had to start my notebook from root to allow the key presses to work properly. So just go to a terminal and type sudo jupyter notebook. In addition to this, I had to uncomment c.NotebookApp.allow_root=True in my ~./jupyter/ file. If the file doesn’t exist you may have to generate it using the command jupyter notebook — generate-config. The file should be located in your home folder inside of .jupyter. If you don’t show all of your hidden files on your mac, you should. Just go to your finder and hit cmd+ shift+ .

Ok, now we are ready to get some functions to press keys and record key presses.

def on_press(key):
print('alphanumeric key {0} pressed'.format(
return False
except AttributeError:
print('special key {0} pressed'.format(
return False
def on_release(key):
print('{0} released'.format(
if key == keyboard.Key.esc:
# Stop listener
return False
def key_check():
with keyboard.Listener(
on_release=on_release) as listener:

I had keys_to_output() return a one hot encoded list indicating which key press was made because at some point the model will have to predict left, right, straight and the model will return a numpy array of the probabilities of each. Ultimately this will result in a length 3 list one hot encoded to the appropriate direction so just for consistency the function returns key presses in this way.

def keys_to_output(keys):

if "'a'" in keys:
return [1,0,0]
elif "'d'" in keys:
return [0,0,1]
elif "'w'" in keys:
return [0,1,0]
return [0,0,0]
def PressKey(key):
def ReleaseKey(key):

I think there’s actually a lot of room for improvement here in the way a left() or right() is constructed. I noticed while I was playing the game I tended to press straight continuously even when I was trying to turn left. So I tried to build this into the functions a bit. Alternatively, you could just build a model that predicts combinations of left, straight, right as well.

def straight():
def left():
def right():
def slow_ya_roll():

So to create training data we want to have a function that records the frame around the game window while recording what key presses are being made and then save each frame and each key press to a numpy array where we can later put the key presses in a csv that corresponds with the filename of the frame. Figuring out the bounding box was probably the most annoying part of this process when I was using PIL's ImageGrab function because the pixel values were not as expected and then I discovered this python package called mss which essentially took my life from being terrible to being great. ImageGrab was performing loops in about 1.2s and mss averaged around 0.05s. Also in mss defining the box to record is very clear. I just hit cmd + shift + 4 on my mac that brings up the little screenshot cursor and checked which pixel values it said were the top and left, then just calculated the width and height.

Code notes for record_frames() :

key_check() tracks what keys are being pressed and appends to list.
- the list should be initialized beforehand with some key value: keys = ["'w'"]
- then the key press is mapped to [[1,0,0],[0,1,0],[0,0,1]] and appended to training_data

def record_frames(file_name):

# keys = ["'w'"]

for i in list(range(4))[::-1]:

with mss() as sct:
# Part of the screen to capture
monitor = {"top":79,"left":265,"width":905,"height":586}
while "Screen capturing":
last_time = time.time()
# Get pixels from the screen, save it to a Numpy array
screen = np.array(sct.grab(monitor))
print("fps: {}".format(1 / (time.time() - last_time))) # screen = np.array(ImageGrab.grab(bbox=(265 * 2,79 * 2,1170 * 2,665 * 2))) # previous slower method last_time = time.time() screen = cv2.cvtColor(screen, cv2.COLOR_BGR2RGB)
screen = cv2.resize(screen, (224,224))
# # uncomment if you want to see what the screen recorder sees # cv2.imshow('window2', cv2.cvtColor( cv2.resize(original_image, (800,600)), cv2.COLOR_BGR2RGB)) key_check()
output = keys_to_output([keys[-1]])
if cv2.waitKey(25) & 0xFF == ord('q'):
if len(training_data) % 10 == 0:

I wasn’t able to play consistently for hours and hours and I don’t expect you will be able to either so this essentially creates new numpy files each time you play and you can concatenate them all together later. To append the numpy arrays you can just read them all in and then do np.append(train_data1, train_data2) for however many train_data's you have.

for i in range(100):

file_name = 'training_data_{filename}part{num}.npy'.format(filename ="%m%d%y"), num = i+1)
if os.path.isfile(file_name):
print('File exists, moving on!')
print('File does not exist, starting fresh!')
training_data = [] # initialize training_data
keys = ["'w'"] # initialize keys
record_frames(file_name = file_name)

Your dataset will be imbalanced because naturally you are just pressing w for most frames. To even this out you can shuffle the data and then just take as many of each class (left, straight, right) as the minimum of the group. So if your Counter returns Counter({‘[0, 1, 0]’: 550, ‘[1, 0, 0]’: 247, ‘[0, 0, 1]’: 223}) then you can just take 223 of each randomly. I grabbed this code, with a few slight edits, from to do this.

lefts = []
rights = []
forwards = []
shuffle(train_data1)for data in train_data1:
img = data[0]
choice = data[1]

if choice == [1,0,0]:
elif choice == [0,1,0]:
elif choice == [0,0,1]:
print('no matches')
forwards = forwards[:len(lefts)][:len(rights)]
lefts = lefts[:len(forwards)]
rights = rights[:len(forwards)]
final_data = forwards + lefts + rights
train_data = np.load('training_data_v2_011919.npy')

So now we have training data, so time to start building the model. The first model that we will build is based on Lesson 3 from so you can see the similarities in the code.

from fastai.conv_learner import *
from fastai.core import *
from fastai.transforms import *
from fastai.dataset import *
PATH = "data/"
sz = 224
arch = resnet34
bs = 64

Save all of the training images to a folder and the key presses to a csv.

for i in range(len(train_data)):
cv2.imwrite(f'{PATH}/train3/trucksim_{i}.jpg', train_data[i][0])
directions = ['left', 'straight', 'right']labels = pd.DataFrame({'id':[f'trucksim_{i}' for i in range(len(train_data))],
'label':[directions[np.argmax(train_data[i][1])] for i in range(len(train_data))]})
label_csv = f'{PATH}labels_3.csv'n = len(list(open(label_csv))) - 1 # header is not counted (-1)
val_idxs = get_cv_idxs(n) # random 20% data for validation set
val_idxs.shape, get_cv_idxs(n).shapelabel_df = pd.read_csv(label_csv)

Create the model data object and the transforms.

tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_csv(PATH, 'train3', f'{PATH}labels_3.csv',val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)
def get_data(sz, bs):
tfms = tfms_from_model(arch, sz, aug_tfms = transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_csv(PATH, 'train3', f'{PATH}labels_3.csv',val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)
return data

After you go through this process and you save your model and then want to load it back in, you can skip all of the .fit() stuff and just do learn = ConvLearner.pretrained(arch, data, precompute = False) and then learn.load(‘model0’). The difference here is the precompute = False, your saved model is already trained with the resnet34 architecture so no need to precompute again. As of now, I am only using 500 frames of training data because it takes a lot of playing the game to get more data so I will update as I play more.

learn = ConvLearner.pretrained(arch, data), 5)


epoch      trn_loss   val_loss   accuracy               
0 1.355581 0.955264 0.566667
1 1.202431 0.912144 0.544444
2 1.085148 0.811203 0.588889
3 0.983141 0.716187 0.666667
4 0.897174 0.66741 0.688889
[array([0.66741]), 0.6888888888888889]

So clearly I need more training data. 500 frames won’t cut it.

learn.set_data(get_data(224, bs))

This next one may take some time., 3, cycle_len=1, cycle_mult=2)


epoch      trn_loss   val_loss   accuracy                
0 0.655568 0.662868 0.677778
1 0.637863 0.6479 0.666667
2 0.631665 0.640404 0.677778
3 0.63021 0.648688 0.688889
4 0.608185 0.658299 0.7
5 0.594226 0.659508 0.688889
6 0.588468 0.657246 0.688889
[array([0.65725]), 0.6888888888888889]

Now we have our model so we can test it out on an image. Your model data object has a .val_ds attribute so you can extract the validation dataset simply.

fn = data.val_ds.fnames[1] + fn)
American Truck Simulator
trn_tfms, val_tfms = tfms_from_model(arch, sz)img = val_tfms(open_image(PATH+fn))
log_pred = learn.predict_array(img[None])



The model predicted straight.

Ok, now go take Lesson 9 from I’m not going to put all of the code from the class here, but essentially it goes through the process of creating a Single Shot Detector (SSD) object detection model using the Pascal 2007 dataset. I will give you some notes on the process though with some suggestions. If you clone the fastai repo the notebook you need to run the code is called pascal-multi in the dl2 directory within courses. Within that directory I just created another directory called data which had the following structure:

|- pascal
| |- tmp
| | |- mbb.csv
| | |- mc.csv
| |- models
| | |- drop4.h5
| | |- fl0.h5
| | |- prefocal.h5
| | |- tmp.h5
| |- VOCdevkit
| |- VOCdevkit2
| |- pascal_test2007.json
| |- pascal_train2007.json
| |- pascal_val2007.json
| |- pascal_test2007.json

Most of this is just downloaded from the kaggle page and the models are generated as you work through the lesson. The model that we will ultimately use in the self-driving car is the drop4 model. I would also recommend reading through this article as you go through Lesson 9, it’s a great walkthrough with some really nice insights.

Once you have everything in the pascal-multi notebook ready to run, just run all the cells and then go make yourself some dinner or go to the gym or both because it’s going to take a while. Afterwards, you should have a model called drop4 and you can place that in your current working directory for this project.

Now we want to load in the drop4 model. This is a bit trickier than the last one because the model uses a custom model data object and a custom head on the architecture. We also want some aspects of the SSD lesson to test things out on our American Truck Simulator data, like, bounding boxes, annotations, anchors, etc.

PATH_pascal = Path('data/pascal')trn_j = json.load((PATH_pascal / 'pascal_train2007.json').open())
IMAGES,ANNOTATIONS,CATEGORIES = ['images', 'annotations', 'categories']
FILE_NAME,ID,IMG_ID,CAT_ID,BBOX = 'file_name','id','image_id','category_id','bbox'
cats = dict((o[ID], o['name']) for o in trn_j[CATEGORIES])
trn_fns = dict((o[ID], o[FILE_NAME]) for o in trn_j[IMAGES])
trn_ids = [o[ID] for o in trn_j[IMAGES]]
JPEGS_pascal = 'VOCdevkit2/VOC2007/JPEGImages'
IMG_PATH_pascal = PATH_pascal/JPEGS_pascal
def get_trn_anno():
trn_anno = collections.defaultdict(lambda:[])
for o in trn_j[ANNOTATIONS]:
if not o['ignore']:
bb = o[BBOX]
bb = np.array([bb[1], bb[0], bb[3]+bb[1]-1, bb[2]+bb[0]-1])
return trn_anno
trn_anno = get_trn_anno()

id2cat is all of the categories that a prediction could be, this will be very important for us in annotations and decision making for the self-driving vehicle.

id2cat = list(cats.values())def hw2corners(ctr, hw): return[ctr-hw/2, ctr+hw/2], dim=1)anc_grids = [4,2,1]
anc_zooms = [0.7, 1., 1.3]
anc_ratios = [(1.,1.), (1.,0.5), (0.5,1.)]
anchor_scales = [(anz*i,anz*j) for anz in anc_zooms for (i,j) in anc_ratios]
k = len(anchor_scales)
anc_offsets = [1/(o*2) for o in anc_grids]

k is 9

anc_x = np.concatenate([np.repeat(np.linspace(ao, 1-ao, ag), ag)
for ao,ag in zip(anc_offsets,anc_grids)])
anc_y = np.concatenate([np.tile(np.linspace(ao, 1-ao, ag), ag)
for ao,ag in zip(anc_offsets,anc_grids)])
anc_ctrs = np.repeat(np.stack([anc_x,anc_y], axis=1), k, axis=0)
anc_sizes = np.concatenate([np.array([[o/ag,p/ag] for i in range(ag*ag) for o,p in anchor_scales]) for ag in anc_grids])
grid_sizes = V(np.concatenate([np.array([ 1/ag for i in range(ag*ag) for o,p in anchor_scales]) for ag in anc_grids]), requires_grad=False).unsqueeze(1)
anchors = V(np.concatenate([anc_ctrs, anc_sizes], axis=1), requires_grad=False).float()
anchor_cnr = hw2corners(anchors[:,:2], anchors[:,2:])
n_clas = len(id2cat)+1
n_act = k*(4+n_clas)
MC_CSV = PATH_pascal/'tmp/mc.csv'CLAS_CSV = PATH_pascal/'tmp/clas.csv'
MBB_CSV = PATH_pascal/'tmp/mbb.csv'
mc = [[cats[p[1]] for p in trn_anno[o]] for o in trn_ids]
id2cat = list(cats.values())
cat2id = {v:k for k,v in enumerate(id2cat)}
mcs = np.array([np.array([cat2id[p] for p in o]) for o in mc])
val_idxs_pascal = get_cv_idxs(len(trn_fns))
((val_mcs,trn_mcs),) = split_by_idx(val_idxs_pascal, mcs)
mbb = [np.concatenate([p[0] for p in trn_anno[o]]) for o in trn_ids]
mbbs = [' '.join(str(p) for p in o) for o in mbb]
df = pd.DataFrame({'fn': [trn_fns[o] for o in trn_ids], 'bbox': mbbs}, columns=['fn','bbox'])
df.to_csv(MBB_CSV, index=False)]

Create the new model data object:

aug_tfms = [RandomRotate(3, p=0.5, tfm_y=TfmType.COORD),
RandomLighting(0.05, 0.05, tfm_y=TfmType.COORD),
tfms_pascal = tfms_from_model(arch, sz, crop_type=CropType.NO,
md_pascal = ImageClassifierData.from_csv(PATH_pascal, JPEGS_pascal, MBB_CSV, tfms=tfms_pascal, bs=bs, continuous=True, num_workers=4)

Then add the custom model data class:

class ConcatLblDataset(Dataset):
def __init__(self, ds, y2):
self.ds,self.y2 = ds,y2 =
def __len__(self): return len(self.ds)

def __getitem__(self, i):
x,y = self.ds[i]
return (x, (y,self.y2[i]))
trn_ds2 = ConcatLblDataset(md_pascal.trn_ds, trn_mcs)
val_ds2 = ConcatLblDataset(md_pascal.val_ds, val_mcs)
md_pascal.trn_dl.dataset = trn_ds2
md_pascal.val_dl.dataset = val_ds2

Then add the custom head to the architecture:

class StdConv(nn.Module):
def __init__(self, nin, nout, stride=2, drop=0.1):
self.conv = nn.Conv2d(nin, nout, 3, stride=stride, padding=1) = nn.BatchNorm2d(nout)
self.drop = nn.Dropout(drop)

def forward(self, x): return self.drop(

def flatten_conv(x,k):
bs,nf,gx,gy = x.size()
x = x.permute(0,2,3,1).contiguous()
return x.view(bs,-1,nf//k)
class OutConv(nn.Module):
def __init__(self, k, nin, bias):
self.k = k
self.oconv1 = nn.Conv2d(nin, (len(id2cat)+1)*k, 3, padding=1)
self.oconv2 = nn.Conv2d(nin, 4*k, 3, padding=1)

def forward(self, x):
return [flatten_conv(self.oconv1(x), self.k),
flatten_conv(self.oconv2(x), self.k)]
drop=0.4class SSD_MultiHead(nn.Module):
def __init__(self, k, bias):
self.drop = nn.Dropout(drop)
self.sconv0 = StdConv(512,256, stride=1, drop=drop)
self.sconv1 = StdConv(256,256, drop=drop)
self.sconv2 = StdConv(256,256, drop=drop)
self.sconv3 = StdConv(256,256, drop=drop)
self.out0 = OutConv(k, 256, bias)
self.out1 = OutConv(k, 256, bias)
self.out2 = OutConv(k, 256, bias)
self.out3 = OutConv(k, 256, bias)
def forward(self, x):
x = self.drop(F.relu(x))
x = self.sconv0(x)
x = self.sconv1(x)
o1c,o1l = self.out1(x)
x = self.sconv2(x)
o2c,o2l = self.out2(x)
x = self.sconv3(x)
o3c,o3l = self.out3(x)
return [[o1c,o2c,o3c], dim=1),[o1l,o2l,o3l], dim=1)]
k = 9
head_reg4 = SSD_MultiHead(k, -4.)
models = ConvnetBuilder(arch, 0, 0, 0, custom_head=head_reg4)
learn_pascal = ConvLearner(md_pascal, models, precompute=False)
learn_pascal.opt_fn = optim.Adam

Now just load the model you created from Lesson 9 of


So now we have the object detection model loaded we should test it out on one of our images.

import as cmx
import matplotlib.colors as mcolors
from cycler import cycler
from matplotlib import patches, patheffects
def get_cmap(N):
color_norm = mcolors.Normalize(vmin=0, vmax=N-1)
return cmx.ScalarMappable(norm=color_norm, cmap='Set3').to_rgba
num_colr = 12
cmap = get_cmap(num_colr)
colr_list = [cmap(float(x)) for x in range(num_colr)]
def bb_hw(a): return np.array([a[1],a[0],a[3]-a[1]+1,a[2]-a[0]+1])def show_ground_truth(ax, im, bbox, clas=None, prs=None, thresh=0.3):
bb = [bb_hw(o) for o in bbox.reshape(-1,4)]
if prs is None: prs = [None]*len(bb)
if clas is None: clas = [None]*len(bb)
ax = show_img(im, ax=ax)
for i,(b,c,pr) in enumerate(zip(bb, clas, prs)):
if((b[2]>0) and (pr is None or pr > thresh)):
draw_rect(ax, b, color=colr_list[i%num_colr])
txt = f'{i}: '
if c is not None: txt += ('bg' if c==len(id2cat) else id2cat[c])
if pr is not None: txt += f' {pr:.2f}'
draw_text(ax, b[:2], txt, color=colr_list[i%num_colr])
def get_y(bbox,clas):
bbox = bbox.view(-1,4)/sz
bb_keep = ((bbox[:,2]-bbox[:,0])>0).nonzero()[:,0]
return bbox[bb_keep],clas[bb_keep]
def actn_to_bb(actn, anchors):
actn_bbs = torch.tanh(actn)
actn_centers = (actn_bbs[:,:2]/2 * grid_sizes) + anchors[:,:2]
actn_hw = (actn_bbs[:,2:]/2+1) * anchors[:,2:]
return hw2corners(actn_centers, actn_hw)
def torch_gt(ax, ima, bbox, clas, prs=None, thresh=0.4):
return show_ground_truth(ax, ima, to_np((bbox*224).long()),
to_np(clas), to_np(prs) if prs is not None else None, thresh)
def show_img(im, figsize=None, ax=None):
if not ax: fig,ax = plt.subplots(figsize=figsize)
ax.set_xticks(np.linspace(0, 224, 8))
ax.set_yticks(np.linspace(0, 224, 8))
return ax
def draw_outline(o, lw):
linewidth=lw, foreground='black'), patheffects.Normal()])
def draw_rect(ax, b, color='white'):
patch = ax.add_patch(patches.Rectangle(b[:2], *b[-2:], fill=False, edgecolor=color, lw=2))
draw_outline(patch, 4)
def draw_text(ax, xy, txt, sz=14, color='white'):
text = ax.text(*xy, txt,
verticalalignment='top', color=color, fontsize=sz, weight='bold')
draw_outline(text, 1)

The first time I did this I had issues with passing the model just 1 image for prediction. I’m thinking that the issue was because the model is expecting a batch size of 64 and I’m just passing it 1 image because the error seemed to originate in the BatchNorm layer, but I’m not 100% sure. What I discovered though is doing learn_pascal.model.eval() allowed me to pass the 1 image because it disables the BatchNorm layer.


We can test out the model on the same image we tested earlier using the first model, we assigned the path to fn earlier so we can open the image using’s function open_image() which is a very handy tool for pulling in an image into a numpy array. Then we can perform the transforms on that image and because the model expects a batch size in the dimensions of the image we can just place [None] at the end. So we end up with val_tfms(open_image(PATH+fn))[None], which has shape (1, 3, 224, 224) whereas the original image has shape (224, 224, 3). next(iter(md_pascal.val_dl)) is the way of pulling out the next batch from our model data object, we can use the second element of this to identify our classes (car, truck, bus, etc.).

_,y = next(iter(md_pascal.val_dl))
y = V(y)
x = val_tfms(open_image(PATH+fn))[None]
# x = to_np(x)
b_clas_truck,b_bb_truck = learn_pascal.model(V(x))ax = plt.gca()
bbox,clas = get_y(y[0][0], y[1][0])
a_ic_truck = actn_to_bb(b_bb_truck[0], anchors)
clas_pr_truck, clas_ids_truck = b_clas_truck[0].max(1)
clas_pr_truck = clas_pr_truck.sigmoid()
torch_gt(ax, ima, a_ic_truck, clas_ids_truck, clas_pr_truck, clas_pr_truck.max().data[0]*0.75)

Not bad.

We’re pretty much ready to test this thing out in-game. The object detection model is predicting a lot of classes for each image with many different bounding boxes. We want to weed out some of these superfluous ones and also see if we even care about that object being in the frame and if it’s a threat to us.

One thing we can do is set a threshold of the probability so that it only shows the classifications that are above a certain probability. In practice, I found it was useful to set this to 0.15 but you can test it out with different ones.

Next we want to determine whether or not that classification is a threat. I’m sure there are better ways to do this, but I defined a box surrounding the front of the truck called warning and if the bounding box for the classification overlapped with the warning bounding box I call that a threat. Then I check to see if that classification is to the left or the right of the center of the window and then turn the opposite direction.

Code notes:

Notice all of the labels start in the top left corner
- this is because in the plotting function it says to start at b[:2] (the top left)

if c == 20 (the length of id2cat), then that cell is background

ex) for bbox: [ 68 116 18 29], [68 116] is the top left corner, 18 is the width, 29 is the height

bbox = to_np((a_ic_truck*224).long())
bb = [bb_hw(o) for o in bbox.reshape(-1,4)]; bb
clas = clas_ids_truck
prs = clas_pr_truck
thresh = 0.06
if prs is None: prs = [None]*len(bb)
if clas is None: clas = [None]*len(bb)
for i,(b,c,pr) in enumerate(zip(bb, clas, prs)):
c = float(to_np(c))
pr = float(to_np(pr))
if((b[2]>0) and (pr is None or pr > thresh)):
txt = f'{i}: '
if c is not None: txt += ('bg' if int(c)==len(id2cat) else id2cat[int(c)])
if pr is not None: txt += f' {pr:.2f}'
print(b, b[:2], txt)


[ 75 146  21  11] [ 75 146] 83: car 0.07
[ 58 130 44 20] [ 58 130] 89: train 0.07

Define out bounding box that is our danger zone and a function to check if it overlaps with the predictions.

warning = np.array([64, 64, 96, 64])def overlapping2D(box_a, box_b = warning): 
xmin1, xmax1 = (box_a[0], box_a[0] + box_a[2])
xmin2, xmax2 = (box_b[0], box_b[0] + box_b[2])

ymin1, ymax1 = (box_a[1], box_a[1] + box_a[3])
ymin2, ymax2 = (box_b[1], box_b[1] + box_b[3])

check1Dx = xmax1 >= xmin2 and xmax2 >= xmin1

check1Dy = ymax1 >= ymin2 and ymax2 >= ymin1

if check1Dx and check1Dy and ((xmin1 + xmax1) / 2) < 112:
return np.array([0,0,1])
if check1Dx and check1Dy and ((xmin1 + xmax1) / 2) > 112:
return np.array([1,0,0])
return np.array([0,0,0])
def convert_warnings(warning):
directions = ['left', 'straight', 'right']
return directions[np.argmax(warning)]
def draw_bboxes(img, bboxes, color=(0, 0, 255), thickness=1):
for bbox in bboxes:
cv2.rectangle(img, tuple(bbox[:2]), tuple(bbox[:2]+bbox[-2:]), color, thickness)

Here is the main function to run the self-driving vehicle.

def main():

last_time = time.time()

_,y = next(iter(md_pascal.val_dl))
y = V(y)

for i in list(range(4))[::-1]:

counter = 0
with mss() as sct:
monitor = {"top":79,"left":265,"width":905,"height":586}
while "Screen capturing":
last_time = time.time()
counter += 1
screen = np.array(sct.grab(monitor))
print('loop took {} seconds'.format(time.time()-last_time))
last_time = time.time()
screen = cv2.cvtColor(screen, cv2.COLOR_BGR2RGB)
screen = cv2.resize(screen, (224,224)).astype(np.float32)/255
img = val_tfms(screen)
log_pred = learn.predict_array(img[None])
moves = np.around(np.exp(log_pred))
print('Here are the moves:', moves)
# run object detection model
b_clas_truck,b_bb_truck = learn_pascal.model(V(img[None]))
bbox, clas = get_y(y[0][0], y[1][0])
a_ic_truck = actn_to_bb(b_bb_truck[0], anchors)
clas_pr_truck, clas_ids_truck = b_clas_truck[0].max(1)
clas_pr_truck = clas_pr_truck.sigmoid()
bbox = to_np((a_ic_truck*224).long())
bb = [bb_hw(o) for o in bbox.reshape(-1,4)]
print('Here is a bb:', bb[0])
clas = clas_ids_truck
prs = clas_pr_truck
thresh = 0.15
if prs is None: prs = [None]*len(bb)
if clas is None: clas = [None]*len(bb)
move_warning = np.array([0,0,0])
for i,(b,c,pr) in enumerate(zip(bb, clas, prs)):
c = float(to_np(c))
pr = float(to_np(pr))
if((b[2]>0) and (pr is None or pr > thresh)):
move_warning = move_warning + overlapping2D(b)
cv2.rectangle(screen, tuple(b[:2]), tuple(b[:2]+b[-2:]), (0,0,255), 1)
txt = id2cat[int(c)]
cv2.putText(screen,txt,tuple(b[:2]), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (200,255,155), 2, cv2.LINE_AA)
# # If you want to display the picture
# cv2.imshow("OpenCV/Numpy normal", screen)
cv2.imwrite(f'data/record/screen{counter}.png', cv2.cvtColor(screen, cv2.COLOR_RGB2BGR) * 255) # to save the screens print('Here is the move-warning:', np.argmax(move_warning), move_warning)
if (moves == [1,0,0]).all():
if np.sum(move_warning) != 0:
warning = convert_warnings(move_warning)
if warning == 'right':
if warning == 'straight':
elif (moves == [0,1,0]).all():
if np.sum(move_warning) != 0:
warning = convert_warnings(move_warning)
if warning == 'left':
if warning == 'right':
elif (moves == [0,0,1]).all():
if np.sum(move_warning) != 0:
warning = convert_warnings(move_warning)
if warning == 'left':
if warning == 'straight':
if np.sum(move_warning) != 0:
warning = convert_warnings(move_warning)
if warning == 'left':
if warning == 'straight':
if warning == 'right':
if cv2.waitKey(25) & 0xFF == ord('q'):


loop took 0.06589078903198242 seconds
Here are the moves: [[0. 1. 0.]]
Here is the bb: [14 13 28 24]
Here is the move-warning: 0 [0 0 0]
loop took 0.06248593330383301 seconds
Here are the moves: [[0. 1. 0.]]
Here is the bb: [14 14 28 24]
Here is the move-warning: 0 [0 0 0]
loop took 0.05881690979003906 seconds
Here are the moves: [[0. 1. 0.]]
Here is the bb: [14 13 31 27]
Here is the move-warning: 0 [0 0 0]
AI trained on 500 frames of training data — would not get in that truck!

That’s it!

Clearly, this is still a work in progress and there’s tons of room for improvement, but it’s pretty cool how easy it was to get up and running with a self-driving car utilizing object detection with

Afterwards, if you want to turn your saved images into a video I found using ffmpeg was helpful, albeit a tad confusing. One suggestion is to brew install it, not download it from their website — soooo much easier. Then I learned that the way it is encoded does not play in most video players out-of-box. You have to encode it with a pixel format yuv420p, and the order in which you call the commands really matters! Here’s the command I ran from terminal:

! ffmpeg -start_number 1 -framerate 4 -r 8 -i data/record/screen%01d.png -pix_fmt yuv420p output.mp4

Then if you just want to quickly watch the video you can run ! ffplay output.mp4 (the exclamation point is to run bash from notebook, assuming you’re working in a notebook as I am)

Data Driven Investor

from confusion to clarity, not insanity

Mike Chaykowsky

Written by

ML/stats. Based out of Los Angeles, CA

Data Driven Investor

from confusion to clarity, not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade