Imgcook: How are codes generated intelligently from design files in Alibaba?

Published in

imgcook

16 min readMar 4, 2020

As one of the four major technical directions of the front-end Committee of Alibaba, people may wonder what front-end could do with AI, how to achieve it and whether this will impact heavily on the whole industry.

Based on pipcook, an open-source algorithm framework, we are launching our product, imgcook, which is a platform to generate codes automatically from design documents (Sketch, Photoshop, etc.). Pipcook is a Javascript framework based on tfjs-node. We can build machine learning pipelines in pipcook and tfjs-node will provide all algorithm and training capacities.

This article will focus on these topics, based on the scene to automatically generate codes from design document, to analyze from the perspectives of background analysis, competitive product analysis, and problem resolution

Background Analysis

Machine learning is in full swing in the industry, and AI has became the consensus for the future. KAI-FU LEE also pointed out in ‘AI · future’ that nearly 50% of human work will be replaced by artificial intelligence within 15 years, especially those simple and repetitive work. Moreover, the work of white-collar will be easier to be replaced than that of blue-collar workers since the work of blue-collar workers may need breakthroughs in robots and related technologies of both software and hardware. However, white-collar work can be replaced only by software technology breakthrough. Will our front-end “white-collar” work be replaced, and when and how much will be replaced?

Looking back to 2010, softwares affect almost all industries, bringing prosperity to the whole software industry in recent years; while in 2019, the software industry itself was affected by AI. For instance, Question-to-SQL appears in the DBA field, in which it can generate SQL statements automatically when you ask questions in a field. Meanwhile, TabNine, a source code analysis tool based on machine learning, can assist in code generation. Moreover, in the designer industry, intelligent designer “Luban” has also been launched. What about the front-end field?
Then we have to mention a question that we are familiar with, which is how to automatically generate codes from design document (Design2Code, referred to as D2C). Front-end Committee of Alibaba has focused on the direction of intelligence and the current stage is to improve the efficiency of web development. We will try to put an end to simple and repetitive work, enabling web developers focus on more challenging work!

Competitive Product Analysis

In 2017, a paper, Pix2Code, which is about image to code attract attentions in the industry. It describes how to generate source code directly from the design image with deep learning. Subsequently, similar ideas based on this idea constantly emerged in the community. For instance, Microsoft AI Lab launched Sketch2Code In 2018, which is an open source tool for converting sketch into code. At the end of the same year, Yotako draw people’s attention as the platform to transfer design draft to codes. As such, the machine learning has officially attracted the front-end developers’ attention.

Based on the analysis of competitive products, we can get the following inspirations:

Currently, the object detection capability of deep learning in image is suitable for reusable material identification (module identification, basic component identification, and business component identification) with larger granularity.
The complete end-to-end model that generates code directly from images is highly complex, and the generated code is not reliable. To achieve higher quality, we need several sub-networks work together
When the model can not reach the expected accuracy, the hard rule intervention of the design document can be used. On the one hand, the manual intervention ability can help users get the desired results. On the other hand, these manual rule protocols are also high-quality samples, which can be used as training samples to optimize the recognition accuracy of the model.

Problem Resolution

The goal of generating code from the design document is to enable web developers to improve efficiency of their work and get ride of repetitive work. The general workflow of daily work is as follows for regular front-end developer, especially those client-side developers.

The general workload of web development mainly focuses on view code, logical code and front-end/back-end integration. Next, we will break down the goals and analyze these one by one.

View Code

In view code development, HTML and CSS codes are generally written based on design document. How to improve efficiency? When facing the repetitive work of UI view development, it is natural to think of the solutions of packaging and reusing materials such as components and modules. Based on this solution, various UI libraries were precipitated. There are even higher-level encapsulation which is the platform to build website visually. But reused materials cannot cover all scenarios. There are a lot of personalized business and personalized views. Facing the problem itself, is it feasible to directly generate reliable HTML and CSS codes?

To sum up, basically we are facing following problems:

Reasonable Layout : include absolute position to relative position, redundant node deletion, reasonable grouping, loop judgment and etc.
Element Self-adaption : extensibility of the element itself, alignment between elements, maximum width and high fault tolerance of elements
Semantic : Multi-level semantics of classnames
CSS expression: The background color, rounded corners, lines, etc.
The industry has been trying in this direction for a long term. The basic information of elements in design document can be exported through the plug-in of the design tool, but the problem still remains in the aspect of the high requirement for the design document and poor maintainability of generated code. let’s continue to break down this core issue.

We are building an expert rule system related to the layout algorithm. Yes, this part is more suitable for the rule system at current stage. For users, the layout algorithm needs to be close to 100% availability. In addition, most of the problems involved here are the combination of numerous attributes and values. Currently, rules are more controllable.

However, when it’s hard to use rules to solve some problems, we can use models to assist in solving the problems. For instance, we come across some cases where we need to recognize groups and loop. In the mean time, web developers often use existing UI library to build the UI interface, so it’s also important to recognize base components in the design documents. For these problems, we use pipcook to build an object detection pipeline to train our models and achieve the goals. More over, context semantic recognition across elements is required, this is also the key problem that is being solved by the deep learning. For example, if we want to recognize what the image means in the design draft or why some text corpus was used in some places, we need image classification and text classification models, which are also built from pipcook based on tfjs-node.

Logical code

Normally the web development also include logic codes, including data binding, dynamic and business logic codes. The part that can be improved is to reuse dynamic effect and business logic codes, which can be abstracted as basic components.

Data field binding: this is quite feasible. You can determine the candidate field based on the text or image of the design document, but the cost performance ratio is not high because this is more about business which is not some general logics.
Dynamic effect: the input of this part is design document, and generally the delivery forms of dynamic effect are various, some of which are animated gif demonstration while some are text description or even oral. The generation of dynamic code is more suitable for visual generation. There is no reference for direct intelligent generation, considering that the input-output ratio is not a short term problem.
Business logic: this part of the development is mainly based on PRD, and even the logic dictated by product manager. If you want to generate this part of logic code intelligently, there is too much input, specifically, we need to see what problems intelligentization can be solved in this sub-field.

Thinking about generating logical code

Of course, the ideal plan is to learn historical data like other art fields such as poetry, painting and music. According to the input of PRD, the new logic code can be generated directly, but can the generated code run directly without any error?

At present, although artificial intelligence is developing rapidly, the problems it can solve are still limited. It is necessary to define problems as the types of problems it is good at solving. Reinforcement learning is good at strategy optimization and deep learning is better at computer vision, classification and object detection.

For business logic code, the first thing that comes to mind is to use LSTM (Long Short-Term Memory, Long short-term memory network) which is in term of NLP to obtain the semantics of function blocks. VS Code intelligent Code reminder and TabNine are using this strategy.

In addition, we also found that intelligence can also help identify the location (timing) of logical points in the view, and guess the logical semantics based on the view.

To sum up, Let’s summarize the advantages of intelligence at this stage:

Analyzing and guessing the semantics of high-frequency function blocks (logical blocks) based on historical source code, By this way, you can recommend code blocks when editing code.
We can guess some of the reusable logical points from design draft. For instance, to bind the image or text data to view, we can use NLP classification or image classification to recognize the contents of the elements.

Therefore, in the current business logic generation, the problems that can be solved are relatively limited, especially when new business logic points appear with new logic orchestration, these references are all in the PRD or mind of PD. Therefore, for the business logic generation scheme, the current strategies are as follows:

Field binding: use deep learning to intelligently identifies the semantic classification of text and images in the design draft, especially the text part.
Reusable business logic points: it is intelligently identified based on views. It contains small logic points (one line of expression, or several lines of code that are generally insufficient to be encapsulated into components), basic components, business components.
New business logic that cannot be reused: structured (visualized) collection of PRD requirements is a difficult task and is still being tried.

Summary

From above analysis, we have described the strategies to generate HTML + CSS + part of JS + part of data intelligently. This is the main process of D2C (design to code). The product we developed from this idea is Imgcook . In recent years, with the maturity of third-party plugins of popular design tools (Sketch, PS, XD, etc.) , the rapid development of deep learning even exceeds the trend of human recognition effect, these are the strong background for the birth and continuous evolution of D2C.

Technical solution

Based on the general analysis of the front-end intelligent development mentioned above, we have made an overview and architecture of the existing D2C intelligent technology system, which is mainly divided into the following three parts:

Recognition capability : The ability to identify the design document. This is to intelligently analyze multiple dimensions of information from the design document, including layers, basic components, business components, layouts, semantics, data fields, and business logic. If the intelligent recognition is not accurate, then the human intervention will be used to correct errors. On the one hand, the high-availability code will be generated from low-cost intervention. On the other hand, the artificial corrections can also be used as sample for online training.
Expression ability : Mainly output the data and access the engineering part
Use DSL to make the standard structured description Schema2Code
Perform Project Access through IDE plug-ins
Algorithm Engineering : To better support the intelligentization required by D2C, high-frequency capabilities are served, mainly including data generation, processing and model services
Sample generation: mainly process the sample data of each channel and generate samples

(Summary layering of front-end intelligent D2C capabilities)

In the whole project, we use the same Data Protocol Specification (D2C Schema) to connect different parts of the architecture shown above. This is to ensure that the recognition can be mapped to the specific corresponding fields, and the code can be correctly generated through schemes such as the code generation engine during the expression layer.

Intelligent Identification Layer

In the entire D2C project, the core is the recognition capability part. The specific decomposition of this layer is as follows, the subsequent series of articles will focus on these subdivided layers.

Material Identification layer : To identify the materials in the image through the ability of image recognition. including module recognition, atomic module recognition, basic component identification, business component identification.
Layer processing layer : Mainly separate the layers in the design document or image, and combine the recognition results of the previous layer to sort out the layer meta information.
Layer reprocessing layer : further normalize the data from previous layers
Layout algorithm layer : Convert the absolute position layout relative position and Flex layout.
Semantic layer : The multi-dimensional features of the layer are used to make semantic expressions on the generated code.
Field binding layer : Bind and map the static data in the layer with the actual back-end data.
Business logic layer : Generates the business logic codes through the business logic identification and expresser.
Output engine layer : Finally output the codes that have been intelligently processed by each layer various DSL.

(Technology layering of D2C identification ability)

Technical difficulties

Of course, incomplete recognition and low recognition accuracy have always been a major topic of D2C, and it is also our core technical point. We try to analyze the factors that cause this problem from these perspectives:

Definition of the problem is not accurate : The inaccurate definition of the problem is the primary factor that affects the inaccuracy of model recognition. Many people think that samples and models are the main factors. But before that, there may be problems with the definition of the problem at the beginning. We need to judge whether our model is suitable for the problem, and if so, how to define the rules clearly.
Lack of high-quality dataset: The intelligent recognition capability of each layer depends on different dataset. How many front-end development scenarios can our samples cover, how the data quality of each scenario is, and whether the data standards are uniform, whether the feature engineering processing is unified, whether the sample has ambiguity, how is the interconnectivity? This are the problems we are facing now.
Low model recall and misjudgment: We often pile up many different kinds of samples in different scenarios as training, hoping to solve all identification problems through one model, however, this often leads to a low recall rate of partial classification of the model, and misjudgment also exists for some classification with ambiguity.

Problem definition

At present, the computer vision models in deep learning are more suitable for solving the problems of classification and object detection, the premise for us to judge whether the deep model should be used for a recognition problem is whether we can judge and understand the problem by ourselves, whether this kind of problem has ambiguity and so on, and if we cannot judge accurately, then this recognition problem may not be appropriate.

If the judgment is suitable for deep learning classification, then you need to continue to define all the classifications, which need to be rigorous, exclusive, and can be enumerated completely. For example, when doing the semantic proposition of images, what are the common classnames of common images? For example, the analysis process is as follows:

Step 1: Find out as many relevant design documents as possible, Enumerate the types of images.
Step 2: reasonably summarize and classify the types of pictures, which is the easiest place to be controversial. Bad definition and ambiguity will lead to the problem of the model.
Step 3: analyze the features of each type of picture, and whether these features are typical or not, and whether they are the core feature points, because they are related to the inference generalization ability of subsequent models.
Step 4: whether the data sample source of each type of image is available or not, and if not, whether it can be automatically created or not; if the data sample cannot be available, it is not suitable to use the model, and you can replace the hard rules to see the effect first.

There are many such problems in D2C projects. The definition of the problem itself needs to be very accurate and scientific reference based, which is relatively difficult because there is no precedent for reference, you can only use the known experience to try it first, and fix it after the user tests have problems. This is a pain point that requires continuous iteration and continuous improvement.

Sample Quality

To improve sample quality, we need to establish standard specifications for these datasets, build multi-dimentional datasets in different scenarios, and uniformly process and provide the collected data, and it is expected to establish a set of standardized data system.

We are using standard data format provided by pipcook. We provide a unified sample evaluation tool for different problems (classification and object detection) to evaluate the quality of each data set, for some specific models, feature engineering with better effect (normalization, edge amplification, etc.) can be adopted and samples of similar problems are also expected to be able to circulate and compare in different models in the future, to evaluate the accuracy and efficiency of different models.

Model

For model recall and misjudgment, we try to summarize scenarios to improve accuracy. The samples in different scenarios often have some similar features or some key features that affect local feature points, resulting in misjudgment. This results in low recall rate. We expect that we can identify models by converging scenarios to improve model accuracy. We converge the scenario to the following three scenarios: Wireless client side marketing scenario, mini-app scenario, and PC scenario. The modes of these scenes have their own characteristics. Designing different recognition models for each scene can efficiently improve the recognition accuracy of a single scene.

Thoughts of the process

Since a deep model is used, a more realistic problem is that the model cannot identify data other than the features learned in the training sample, and the accuracy rate can not be 100% satisfactory to the user. Besides the samples, what can we do?

In the entire process of D2C, we also follow a methodology for identifying models, that is, designing a set of protocols or rules that can cover cases where deep learning give wrong results. This is to ensure that users can still fulfill their demands when the model recognition is not accurate: Manual convention> rule policy> Machine learning> Deep Learning . For example, you need to identify a loop in the design draft:

At the beginning, we can reach the agreement of the loop manually in the design document.
Based on the context information of the layer, you can make some rule judgments to determine whether it is a loop body.
Using the layer features of machine learning, you can try to optimize rules.
Generate some positive and negative samples of the loop to learn through the deep learning model

Among them, the manually agreed design document agreement resolution has the highest priority, which can ensure that subsequent processes are not disturbed by blocking and error recognition.

Business Landing

2019 Double Eleven

After nearly 2 years of optimization, the first closed-loop development of the marketing module uses D2C. This includes module creation, view code generation, logical code generation, supplementary writing of logical codes and debugging.

In the Double Eleven scene, it covers the new modules of Tmall and Taobao, including various scenarios. 31 modules are supported. About 79.34% codes are generated by D2C, including the automatic generation of view code and some logic code. 98% of simple modules are generated automatically. The main reasons for manual changes to the code are new business logics, animations, field binding recognition errors and loop recognition errors. These issues also need to be gradually improved.

Overall landing situation

As of 2019.11.09, the data is as follows:

The number of modules is 12,681, and about 540 are newly added in this week;
The number of users is 4,315, and about 150 new users are newly added every week;
Number of teams: 24,
Custom DSLs: 109

Currently, the service available are as follows:

Restoration of design draft : Use the Sketch and PhotoShop plug-ins to export the design information with one click to generate codes.
Restoration of Image : Allows you to upload images to directly restore and generate codes.

Follow-up planning

Continue to reduce the requirements of design document, in which the intelligent identification accuracy of grouping and loop are improved and the manual intervention cost of design document is reduced.
The accuracy of component identification has been improved. Currently, the accuracy is only 72%, and the business application availability is low.
The page-level and project-level restoration capabilities are improved, depending on the accuracy of page segmentation capabilities.
Improve page-level restoration of mini-apps and PC programs, and improve overall restoration of complex forms, tables, and charts.
Improve the ability to generate codes from static images, which can be used in the production environment.
Algorithm engineering products are improved and sample generation channels are more diversified.
Open source

In the future, we hope that through the front-end co-construction project, we will use the collective strength to collectively make the front-end intelligent technology solutions inclusive, and deposit more competitive samples and models, providing higher accuracy and more availability services. We hope to reduce simple and repetitive work and let us focus on more challenging work!

Contact us

Github Page: https://github.com/taofed/imgcook
HomePage: https://www.imgcook.com/
Pipcook (front-end machine learning framework): https://github.com/alibaba/pipcook