Play The City | 60th GRAMMYS

How we turned New York into a musical instrument

7 min readMar 5, 2018

“Play The City” is an Augmented Reality experience designed to celebrate the 60th GRAMMY’s return to New York. Together with TBWA/Chiat/Day LA and Tool Of North America, we took an Uber ride and outfitted it with Computer Vision software allowing unsuspecting passengers to play the city like a musical instrument.

As a custom rigged vehicle was driving around, music & visuals were being generated in realtime based on people & objects detected in the streets of New York. The entire experience was captured in real-time with multiple cameras to create the final commercial which was launched at the GRAMMY’s.

Besides design & development of custom tracking software to create unique songs & generate visuals, we also handled on-site support during the video shoot. Quite a challenging installation with a very short turnaround…

Curious how we pulled it off? Keep on reading!

Hardware Setup

To build the installation we used two computers (with a decent GPU), a GoPro camera and some extra hardware to glue everything together. The illustration below shows the full setup and how it is configured to run the applications.

The first machine runs the video server and the computer vision application. The second computer is used to generate the augmented reality layer and music. In total we developed four applications tying this all together. Let‘s have a look at the different parts…

System Architecture — Clipart from *openclipart.org*

1. Video Server

Captures the realtime video feed from a GoPro camera, using a Blackmagic capture card, and streams it to a computer vision application running on the same machine.

Simultaneously it also streams the video to the augmented reality application on the second machine. This stream was used as a base layer, which we augmented with illustrations.

2. Computer Vision

The incoming video stream was analysed using the YOLO detector, which allowed us to track people & objects. Each time an object of interest was being detected, an OSC message was send to the augmented reality application and Midi controller application.

3. Augmented Reality

Captures the incoming video stream and augments it with custom artwork by Musketon, based on the received OSC messages. The final visual output was then pushed to a display mounted outside the car.

4. Midi Controller

The Midi controller application transforms incoming data to Midi notes and Midi CC messages. The Midi notes were then send to Ableton Live to trigger different (pre-recorded) instruments & loops and control the volume for each individual track.

All applications were built using openFrameworks, an open source C++ toolkit for creative coding. We also used several add-ons developed by the community. Without these open source projects, we would not have been able to build this installation in such a short time.

*Some early testing mounting hardware with our standard media art toolkit: cardboard boxes and duct tape*

Video Server

The video server takes the live stream from the camera and distributes it to the other applications. Communication between different applications was handled with Spout (for apps on the same machine) and NDI (distributing the video to the other machine).

First tests were promising, but one of the bottlenecks seemed to be the standard openFrameworks video capture class. We started out by using a Logitech HD Pro Webcam C920, which sends out two video streams:

1. Raw video feed
2. H264 feed

Turns out the openFrameworks ‘ofVideoPlayer’ class does not support taking in H264 encoded streams from the webcam. This forced us to use the raw feed which resulted in a serious frame drop. There are workarounds to deal with this but we didn’t have time to explore this.

So on to plan B… using a GoPro Hero 5 and a Blackmagic Design DeckLink Mini Recorder 4K PCIe Capture Card. To get the captured video feed into openFrameworks, we used the ofxBlackMagix2 library by Elliot Woods.

During development, we also used the Blackmagic Design Intensity Shuttle because we didn’t have full-time access to the main setup.

Computer Vision

The most important challenge of this project was detection of people & objects in a live video stream. To do this we used Darknet by Joseph Redmon, which has a state-of-the-art detecor named YOLO: Real-Time Object Detection.

To get the detector up & running in openFrameworks we used ofxDarknet, developed by Marcel Schwittlick, using the Coco dataset implementation.

On top of the object detection we implemented a ‘playhead’ system, which gave us more fine grained control over when & how many times signals were send through the system. We used the OSC to send out messages.

These ‘playheads’ could be placed anywhere inside the feed, each controlling different animations, instruments or global parameters (volume). The ‘region of interest’ of a certain playhead was defined by its width and position.

Augmented Reality

The Augmented Reality app was creating visual output for the experience, taking in the video stream from the Video Server and augmenting it with custom illustrations based on OSC messages received from the Computer Vision application.

Each OSC message contained the type of object that was detected, its position on screen and the size of its bounding box. Each detected object was then paired with an animation, which was playing back a sequence of PNG images.

Based on visual references from the agency, we brought in Musketon to create the artwork and combined it with generated design elements.

An exception was made for cars, trucks and buses, these were not paired with illustrations but controlled a ribbon that started dancing on the musical notation bars each time they were detected.

In the end, the incoming video and augmented reality layer were composited and displayed on a display mounted on the outside of the car.

*Central ribbon dancing on the musical notation bars, controlled by the presence cars*

Let’s play some music

All music was composed by Eddie Alonso and consisted out of different instruments and loops which were composited in layers in Ableton Live.

The initial idea was to use different instruments for detected objects, which could play different notes at different scales. However this became pretty complex and wasn’t sounding like the musical soundtrack we envisioned for this experience.

For the final experience we ended up with only one instrument which was used whenever people were detected, paired with audio loops for other objects (eg fire hydrants, bicycles, traffic lights, stop signs, ...).

Again a special exception was made for cars, trucks and busses. Just like the central ribbon in the visual output, these were controlling the volume of one of the three base tracks for the composition.

For the instrument attached to people, we laid out some ground rules:

1. The vertical position of a person defines the pitch of a note
2. People positioned at the top of the screen play higher notes
3. All notes should be in the scale of C-Minor

The musical notes being played are based on where an object is detected inside the video. We mapped the height of the video to two octaves, meaning notes could be anything between C2 and B3. Playing all these notes randomly in a composition will sound horrible. To make it more musically, we used the Midi scale effect in Ableton Live. This effect alters the pitch of incoming notes based on a predefined scale map.

*The Scale effect maps all incoming Midi notes to the C Minor scale*

To facilitate all of this we created a Midi controller app which received OSC messages, handled the processing and sending the results to Ableton Live again through Midi and/or OSC.

For this we used ofxMidi by Dan Wilcox and also added a quantizer for incoming notes and controlling playback of audio loops.