Talking Pets — Mobile App

During one night of last month, I had a strange dream. I was talking to my dog, Ginger (a.k.a varadhunni). The interesting thing here was that it wasn’t a one way conversation like in real life where I would be telling him the most boring things ever and out of sheer incapability he would simply sit there with a nod of approval. This time however, he was giving me a peace of his mind. It was like getting a taste of my own medicine and this was clearly a rant out of disappointment and sadness due to the actions of his master, things I was doing wrong and how they affected him. I was feeling guilty, but also happy in a way. I wish I hadn’t woken up from that dream. It was so beautiful, the idea that my dog could talk, manoeuvre his mouth to say #%%*# and become my trusted advisory fascinated me.

Back to reality-The next thing I did was to search for Apps that could make a Pet talk. I use an iPhone and the AppStore does have a few apps that could create a realistic talking effect from a still photo, but there were clearly some gaps out there. And then the idea of Talking Pet was born.

The problem - Computer vision is such a fascinating topic that most engineers would love to work on it. The issue at hand can be classified as a computer vision problem because it requires one to create a sequence of manipulated images from the original still photo, merge them to create a video and adding voice with some sort of pitch modulation. The last 2 parts are fairly straight forward in an Apple ecosystem leveraging the native API’s. So the challenge was really the first part. After some research I knew what I had to do. Technically this would be achieved by using a Mesh deformation or Mesh displacement tactic, i.e creating a mesh from the 2D photo of the Pets face and then manipulating a specific region of interest of the mesh in and around the mouth/chin area (and forehead) and then recreating the photo. Once the displaced cheek sample is created one would draw a mouth shape using traditional drawing methods. To make the drawing more realistic one would apply a slight Gaussian blur to the sample. Once a desirable sample is constructed, the above steps would be repeated to generated multiple samples or images with different y axis displacement values and then the samples would be merged to form a video using an asset writer. Muxing an audio file recording with a video is pretty straight forward just like the audio pitch control.

Here is a demo video of the same. it’s not perfect by any means, its a work in progress and something to look forward to, a bit later.

You can download Talking Pets app from here -