Enhancing ChatGPT’s Capabilities: Reading Images and Generating HTML Code from Sketches using an Image-to-DSL Model with ChatGPT

5 min readApr 20, 2023

ChatGPT, developed by OpenAI, is a powerful language model that has shown remarkable capabilities in various natural language processing tasks. However, as a language model, ChatGPT is unable to read or interpret image-based layouts directly, which limits its ability to achieve some visual dependent tasks.

To overcome this limitation and enhance ChatGPT’s capabilities, I have expermenting the integration of a smaller auxiliary model, an Image-to-DSL model, that can take image inputs and generate Domain-Specific Language (DSL) code as output. This generated DSL code will serve as an intermediate representation of the image layout information, allowing ChatGPT to read and interpret the visual structure within the image.

By integrating the Image-to-DSL model with ChatGPT, we not only enable ChatGPT to process image-based inputs but also enable its ability to perform tasks like generating HTML code from sketches. In this blog post, I will show you how to build the Image-to-DSL model, train it on a dataset of images and corresponding DSL code, and combine it with ChatGPT to create a powerful system capable of generating HTML code from image layouts.

Dataset Preparation:

I have been exploring the possibility of using deep learning models for sketch-to-code tasks using the pix2code dataset in the past. So I will use that dataset for this experiment as well. It is essential to note that the dataset may not be diverse enough to avoid overfitting issues. However, the purpose of this tutorial is to demonstrate the potential of achieving visual-related tasks by combining models, rather than attaining perfect results.

Model Architecture:

I used a pre-trained Vision Transformer (ViT) model as an image feature extractor and a pre-trained GPT-2 model as a decoder to generate DSL code. This combination allows us to leverage the power of both models for our sketch-to-code tasks.

The choice of using a pre-trained Vision Transformer (ViT) model and a pre-trained GPT-2 model for the sketch-to-code task is a powerful one due to the unique strengths of each model.

Vision Transformer (ViT):

The ViT model is a state-of-the-art image classification model that has demonstrated remarkable performance on a wide range of vision tasks. Unlike traditional convolutional neural networks, which process images locally, ViT processes images globally by dividing them into a fixed number of non-overlapping patches and treating each patch as a token. This approach allows the model to capture long-range dependencies and global layout information, which is crucial for understanding the structure of a sketch.

Moreover, the attention mechanism employed by ViT allows it to focus on specific shapes or elements in the image that are relevant to the task at hand. This ability to selectively pay attention to specific parts of the image is particularly useful in the context of the sketch-to-code task, as it allows the model to recognize and focus on important elements of the sketch, such as buttons, text, or other UI components.

GPT-2:

GPT-2 is a powerful language model that has been trained on an extensive corpus of text data. As a result, it has a strong understanding of natural language and can generate coherent and contextually relevant text. This knowledge extends to the structure and syntax of Domain-Specific Languages (DSLs) like HTML, as they are also represented in text form.

By leveraging GPT-2’s understanding of DSLs, we can fine-tune the model to generate HTML code based on the image features extracted by the ViT model. The GPT-2 model can leverage its knowledge of language and DSLs to create syntactically and semantically correct code corresponding to the input sketch, while the ViT model provides the necessary visual context.

Training the Image-to-DSL Model

The Image-to-DSL model can be trained on a consumer-grade GPU with some adjustments, such as reducing the batch size. For instance, it’s possible to train the model on an NVIDIA GeForce GTX 1060 6GB GPU. However, training time may be longer compared to using more powerful GPUs.

To train the Image-to-DSL model, I have added a detailed walkthrough in the form of an IPython Notebook. Thanks to ChatGPT for always being very helpful when assisting with code explanations. You can access the notebook using the link provided below. Simply upload the IPython Notebook to your Google Colab workspace. Google Colab’s free tier is sufficient for training the model, and it is a more cost-effective alternative to using a personal GPU.

Image-to-DSL Notebook Link: https://github.com/mzbac/image2dsl/blob/main/VIT%26GPT2.ipynb

Integrating ChatGPT with the Image-to-DSL Model:

Integrating ChatGPT with the Image-to-DSL model is a straightforward process but it’s crucial to craft well-designed prompts for the ChatGPT API call, which ensures that the model effectively understands the DSL code and generates the desired output. In this article, we won’t delve into the details of prompt engineering, as it would be a lot to cover. Instead, a complete example will be provided at the end of the article to demonstrate the practical implementation of this approach.

Testing:

Refer to this video demo for a more interactive illustration of the model’s capabilities:

Conclusion:

In this tutorial, we have demonstrated how to enhanceChatGPT’s capabilities to generate HTML code from sketches by integrating it with an Image-to-DSL model. We have observed the following:

By providing a structured DSL code to ChatGPT, the model can effectively generate HTML code with the correct layout and fill out placeholder content. This approach takes advantage of ChatGPT’s powerful language understanding and generation capabilities, making it a suitable choice for creating web pages from sketches.
One limitation we encountered was that the Image-to-DSL dataset was not diverse enough, resulting in overfitting behavior. The model consistently tried to convert square blocks into title, description, and button structures. To overcome this limitation, I think by creating a more diverse dataset or exploring data augmentation techniques.
Despite low accuracy in the generated DSL code, ChatGPT was still able to create reasonably good HTML output. By leveraging ChatGPT’s conversational ability, we can use it as an interactive model to iteratively update the HTML code and obtain the desired outcome. This approach is more user-friendly and intuitive compared to using prompts to describe page layouts.

In summary, combining ChatGPT with an Image-to-DSL model offers a novel way to generate HTML code from sketches, providing a more interactive and accessible way for developers and designers to create web pages. With further improvements to the dataset and model architecture, this approach can become even more robust and accurate, making it a valuable tool for web development.

Resources:

For further exploration and reference, please find below the relevant GitHub links:

ChatGPT integration with Image-to-DSL: https://github.com/mzbac/image2dsl
Sketch2Design frontend UI: https://github.com/mzbac/sketch2design

These repositories contain the source code, examples, and instructions to set up and use the models described in this tutorial. Feel free to contribute, report issues, and share your own implementations with the community.

Enhancing ChatGPT’s Capabilities: Reading Images and Generating HTML Code from Sketches using an Image-to-DSL Model with ChatGPT

Written by Anchen