Part 2. Sketch3D: Creating 3D Objects from Sketches

Published in

The Startup

15 min readApr 26, 2020

This is the 2nd part of a 3 part series for Sketch3D, an application to perform sketch-based authoring in an augmented reality (AR) environment to turn 2D sketches into 3D objects.

What Was Covered in Part 1?

Introduction into the Unity Editor (AR/VR Development Tool)
Using the Vuforia AR Engine to detect Image Targets and anchor virtual content
Setting up the application for deployment on an Android device
Easy start-up through Github (if Unity and Vuforia are already installed)

What’s Covered in Part 2?

Communication between a Python server and an AR Android application written in C#
Computer vision techniques to process 2D sketches
Dynamic rendering of 3D objects in the Unity environment
and of course… Code!

What’s Covered in Part 3?

The basics of Facebook’s PyTorch library, a deep learning toolbox, for training a 2D segmentation network.
The heavy data augmentation used to create a data set suitable for performing annotation segmentation
Training and testing of the U-Net 2D segmentation network in Google Colab environment
Deployment and integration of the segmentation network into Sketch3D

Resources

Github link: Sketch3D project repository
Google Colab: A Google colab notebook (ipynb) to train the segmentation network.

Section 1. Designing a Simple Python Server

The project is separated into two parts: a Python server and the Android client application written in C# using the Unity development platform. The purpose of the Python server is to process the sketch captured by the camera of the Android phone and to send the necessary information to construct a 3D object back to the client. When the target image is detected, the client constructs the 3D object and renders it in a mixed reality environment viewable on the phone. Let’s begin with a very simple HTTP Python server to introduce the client/server architecture.

You can replace the IP address field in the main method to the IP address of your device (the IP address can be found by running the terminal command ifconfig on Linux devices or ipconfig on Windows/Mac devices). The handler() class is used to define custom behavior to handle incoming HTTP requests. In the example above, the function do_GET() is used to handle GET requests and do_POST() is used to handle POST requests. The self.path instance variable associated with our Handler instance can be used to route specific requests to the appropriate handler based on the path property.

The API is actually quite simple and is summarized below. The HTTP GET /data request acquires the information necessary to construct the 3D object and the HTTP POST request sends an image to the server to be processed. Note: The HTTP GET/annotation request will be discussed in Part 3. when annotation support is introduced.

Section 2. Processing Sketches

The processing routines are encapsulated inside a class called Processor(). Upon receiving an image from the Android application, the server uses the Processor instance to maintain the state of the application and perform the necessary computer vision tasks. Sketch3D works by having the user take a picture of the side and front of the 3D object to construct. When an image is sent to the server, the Processor instance keeps track of the type of image (“Side” or “Front”) and any useful information that is computed during the processing pipeline, such as the location of the corners of the sketch.

High-Level Overview of Processing Pipeline

The general processing pipeline to convert a sketch into a 3D virtual object is as follows:

The Android application sends an image to the server via HTTP POST request.
The Processor instance crops the image to the region of interest.
The Processor instance looks for any annotations with defined behavior and removes the annotations from the image (Part 3 discusses the annotation feature).
The Processor instance extracts the corners of the sketch.
The Processor instance constructs a list of faces, where each element of the list is a set of 3D points that define the vertices of that face.
The server encodes and sends the information to construct the 3D object to the client through the HTTP GET /data request.

Receiving and Cropping an Image File

The server uses the following code snippet to handle the POST request that contains an image file (png) captured by the AR Android application.

The image is cropped after being read into a Numpy matrix using OpenCV. The coordinates of the cropping method are provided as additional information in the POST request. These coordinates are determined by the client user-interface.

Figure 1. AR Android Application UI. The region of interest is graphically indicated by the light-gray box.

Upon clicking either button in the upper-left or upper-right corner, which capture an image for the side and front of our 3D object respectively, a screen shot of the user-interface and the bounding box coordinates are sent to the server. The coordinates of the cropping function are based on the coordinates defining the bounding box shown in light-gray on the user interface (Figure 1).

Extracting the Corners of the Sketch

Now that we have isolated the sketch in our region of interest, we now want to extract its corners so that we can reconstruct the polygon. The following code snippet below is used to extract the corners of the polygon from the sketch.

In the code, the first step is to convert the image into a gray scale. The OpenCV goodFeaturesToTrack() function is used to detect corners in our image. Then, the convexHull() function finds the smallest convex polygon that encloses the set of points extracted in the previous step. In mathematical terms, a convex hull is a shape with no interior angles greater than 180 degrees. You can imagine this function as attempting to tighten a closed loop around the detected corners so long as the surface of the loop never trends inwards or forms a ‘V’ — any points touching this loop are kept and the rest of the points are discarded. This is useful for several reasons. First, if some points were detected inside of the sketch they will be ignored by this function because touching these points would cause a ‘V’ or an interior angle greater than 180 degrees. Furthermore, convexHull returns a set of points that are in a clockwise orientation, which is useful because in Unity in order to define a polygon, you must specify the points in a clockwise, winding order in the direction that you are facing the polygon in the virtual environment.

Winding order must be specified because it determines whether or not an object is being seen from the front or the back. This requirement is kind of confusing at first but what it means simply is that in Unity defining a polygon that can be viewed from any angle requires defining two sets of vertices, one in clockwise, winding order and the other in counter-clockwise winding order. If only one is specified, there is the unexpected behavior that viewing from one side of the polygon will result in an invisible polygon (weird!). It’s not how things work in real-life, but in our virtual environment these are the rules.

Creating 3D Faces

Let’s say at this point the client has sent over two pictures, a sketch of the front and side of the 3D object to create, and the server has detected the corners of these sketches. The android application now detects the Image Target and wants to construct the 3D object. To do so, the server must construct a set of faces that define a 3D object using only the vertices defining the front and side face. The code snippet below describes this process at a high-level.

First, the vertices are normalized across the width and height of the input sketch. Next, a z-axis is added to each point to create 3D points. Then, the front face is rotated by 90 degrees around the z-axis using a rotation matrix. Next, the back face is extruded and the points of the front, side, and extruded back face are matched together. Finally, the remaining sides of the 3D object are constructed and a list is returned, where each element of the list is a set of points defining a face of our 3D object. Each point is also scaled by a constant factor to make the object realistically rendered in our mixed reality environment.

As a side, the back face is extruded by considering the scale of the side and number of points. For example, if the side has 4 corners, then the relative size of the back face to the front face is the relative size of the front 2 corners to the back 2 corners. If the side has 3 corners, then the back face is a single point. Fewer than 3 corners or greater than 4 corners is not supported by Sketch3D.

Sending over the 3D Faces

Finally, we are at the last step of the processing pipeline. At this point, we have a list faces. All we have to do now is send this list to the Android client and let it deal with constructing the object. To send the list over, a custom encoding method is defined in our GET /data request handler. The snippet is below.

The server sends a 200 response code if the side and front sketch have been successfully processed and a 500 if either the side or front face sketch have not been successfully processed. The android application, which will be discussed next, will alert the user if a 500 code has been received.

You did it! That’s all for the server. The detailed implementations of each function discussed can be found in the Github project in the process.py and server.py files. Now, onto the client!

Section 3. The AR Android Application

This section covers the basics of the AR android application. The AR application is created in the Unity game engine using the Vuforia developer package to provide some handy AR computer-vision functionalities. The application is written in C# per Unity. The major functions of the AR application are outlined below.

Defining an interactive user-interface (buttons to capture the photos, alerts of errors, etc.)
Communication with the server (sending images of the sketch via POST requests and receiving data through GET requests)
Reconstruction of a 3D object in our mixed reality environment

Note: Part I discussed setting up the AR application, such as defining an image target, defining an empty mesh used to create a 3D object, etc. This part is purely code based; however if the Unity environment was not populated with the correct objects/configurations, this code will not work.

Case Study: Constructing a Cube

The user-interface can be split into three parts: when the image target is not detected, when the image target is detected, and when there is an error. Let’s walk through a simple example that hits all of these parts: constructing a cube.

To construct a cube, a photo of a square is sent as both the side and front face. Let’s imagine that at first, before collecting either the side and front face, the user points the camera of the phone to the target image. They will be presented with the following alert in Figure 2, informing them that the server does not have enough information to construct the 3D object.

Figure 2. An alert if there is missing data.

Now, let’s assume that the user then captures an image of the side and the server successfully extracts its four corners. Now, the user wants to take an image of the front square. Figure 3. shows what this UI would look like at this point. The side button is “green” indicating that the side corners are known by the server and the front button is “red” indicating that the user must still capture an image of the front sketch. If the “red” button is clicked in the example UI shown in Figure 3. the contents in the gray-box will be sent to and processed by the server. If the ‘green’ button is clicked, the information defining the front face will be overridden.

Figure 3. An example screen shot of the user-interface while collecting the sketches to use in constructing a 3D object. The side button is green, indicating that a side image has successfully been processed. The user can click the red button to capture an image of the front face, in this case, a square with a “B” in the center will be sent to the server for processing. The annotation compatibility will be discussed in Part 3.

Assuming that both sketches have now been successfully processed, upon detecting the target image, the 3D object will be constructed as shown in Figure 4. The user can then interact with the object in three different ways: scaling, translation, and rotation. For each manipulation, the user specifies which axis (X, Y, or Z) to perform the operation. In the case of scaling, one can upscale or downscale the entire object.

Figure 4. The constructed augmented reality cube

Also, you may have noticed the cube has a blue side! That is due to the annotation recognition system, a feature that will be described in Part 3. More on that later!

Communication with the Server

Now that we have covered an example, let’s dive into parts of the client code. As previously mentioned, the client must be able to use a HTTP GET request to receive data from the server and a HTTP POST request to send image data to the server. The code snippet below captures an image of the screen of the AR application and sends it to the server through an HTTP POST.

The code snippet below performs an HTTP GET /data to receive the list of faces from the server and process the list into a usable format. In addition, if there is an error, then the Button.cs class is called to alert the user, as seen in Figure 2.

Constructing the Augmented 3D Object

In general, the AR application performs the following operations to construct the AR object:

Detection of the image target
An HTTP GET /data request
Decoding the list of corners to construct a mesh
Adjustment of the position of the 3D object

Vuforia is used to detect the target image by overriding the OnTrackableStateChanged() function and monitoring the Status variable as shown below.

Once the image target is detected, the data is received using the GetData() function which invokes the HTTP GET /data request. The CreateMesh() and UpdatePos() function add the object to our mixed reality environment and adjust its position so that it is viewable near the image target. Once the image target is no longer detected, any 3D object that was constructed is then destroyed, allowing for the dynamic creation of a new virtual object.

The most important function arguably is theCreateMesh() function which creates our 3D object. In order to define a 3D face in Unity, 3 things must be specified: the vertices of the face, a set of triangles defining the details of the face, and the normals of the triangles which determine which direction the face is rendered. Luckily, the function Mesh.RecalculateNormals() can be used to calculate the normals after the vertices and triangle properties have been set of our new Mesh() object.

If you remember, each element in our face list contains a set of points that define a particular face by defining its 3D vertices with clockwise ordering. However, to construct the mesh, we also have to specify the triangles and normals. For example, in the case of a square, our face element would have four vertices. We can have at a minimum 2 triangles, defined by splitting the square along one of its diagonal. Then, we have to construct a list of 6 vertices where each pair of 3 vertices is defined in clock-wise order. Each normal can then be inferred based on this ordering. If this sounds like a lot, it is! Let’s simplify it.

To make life easier, we encapsulate the creation of triangles in a class called Triangulate() which is shown in the code snippet below.

Triangulate is used to extract the triangle list from a set of 2D points. For each 3D face, we can project it’s points onto two dimensions, extract the triangle list, and then re-project the points back into three dimensions. To do this, we can use createMesh(). Not going to lie, the createMesh() function is heavily involved. The code snippet is below, but we will go through it at a higher level.

In our face list, the first 2 elements define the front and back face of our 3D object. As an overhead, I stored the number of vertices as the last element in this list. It could also be calculated by dividing the original list of vertices by 3, but sometimes its hard to remember why you make the choices you do :)

The outer most foreach loop goes through each list of vertices in our list of faces. We then construct a list of 2D vertices because our triangulate method only works on 2D objects. The front and back face are projected on the X, Y axis and the side faces on the X, Z axis.

This list of 2D vertices is passed into the Triangulator class and the vertices are then re-projected back into 3D. This is done by simply adding the deleted dimension from the previous step.

Next, an empty mesh is created and the vertices property is assigned to this list of 3D vertices, where the triangles properties is the one gathered from the triangulate() function of the Triangulator instance. We then call RecalculateNormals() and RecalculateBounds() to properly initialize the last parts of the mesh. The second function is to ensure that the bounding volume is correctly computed.

To add the 3D object to our virtual scene, we first find the empty mesh asset that was created in Part 1 using the command GameObject.find("newObject") and a new child object is added. A few properties of the child object, including its material and its color, can be changed, but most importantly we must specify this parent-child relationship using child.transform.parent = gameObject.transform.

You’ll see that some of the code in the lower-half of createMesh() is repeated, mainly making another mesh with the same coordinates and texture followed by manipulation of that new object’s normals and triangles. You may be wondering what the purpose of this is? In short, its a trick to sacrifice time for simplicity.

Remember the whole thing about specifying the winding order of vertices for you to be able to view an object in Unity? Well this part of the function makes a copy of the previously calculated mesh, inverts its normals, and reorders the triangles. Thus, we will see the face rendered whether we view it from the front or the back. This is necessary because one of the annotations that will later be incorporated into Sketch3D is the ability to make a hollow object by adding an “H” to the sketch. If we didn’t perform this step of adding an inverted mesh, then it might look like we are missing some of the faces when we look through the hollow object.

And that’s it! In summary, we iterate through the list of faces, create two meshes to represent the front and back of each face, and add the meshes as a child of our otherwise empty 3D object!

Section 4. Example Creations

Now that we have it all working, we can deploy it on an Android application and start the server! The Github describes the exact steps for deployment, so I’m just going to show some sample results below.

A pyramid (side=triangle and front=square)

A prism with a scaled down back (side=trapezoid and front=triangle)

A cylinder (side=rectangle,front=circle)

Section 5. What’s Next

In the final installment of the Sketch3D series, I will show how to add annotations to manipulate the objects that are created. Annotations are a great solution for sketch-based manipulation because they can easily be incorporated into a sketch and can be mapped to complex actions. For example, we can use annotations to manipulate the color of our object as shown in the green pyramid below.

Annotations can also be mapped to more complex manipulations, such as hollowing out the object, like the hollowed-out, extruded pentagon below.

To add this feature, we have to perform annotation detection and classification. However, one apparent issue is that annotation may mess with our corner detection algorithm and affect the integrity of our constructed virtual object. To fix this issue, our mechanism to detect annotations should be highly precise such that the annotation can be removed. To accomplish simultaneous detection, classification, and removal of the annotations a segmentation network is trained to assign each pixel of the sketch a label that corresponds to a predicted annotation. Part 3 will discuss how this annotation segmentation network is trained and deployed in Sketch3D because ultimately, the benefit of sketch-based authoring will be in the ability to quickly customize the 3D virtual objects.

Section 6. Closing Words

This article discussed the core of Sketch3D and the major components of the client/server architecture that allows for the dynamic creation of 3D virtual content from 2D sketches. If there are any comments, questions, or concerns, feel free to reach out in the comments below. Cheers!