Electrifying AI: Leveraging Generative AI in Hardware & Chip Design

Published in

AILECTRIFY

10 min readMay 7, 2024

Digital artwork of an electronic brain-shaped chip embedded on a detailed circuit board by Dall-E 3.

Generative AI and especially large language models (LLMs) are currently at the forefront of artificial intelligence research and application, opening up new possibilities in a range of fields. In the field of electronics engineering, generative AI can also be employed for a variety of tasks. LLMs demonstrate the ability to understand and generate computer languages, such as programming languages or markup languages. Furthermore, they can be specifically adapted to other computer languages by using specific fine-tuning techniques. Computer languages are also utilized in the domain of hardware and chip design. Hardware description languages (HDLs), such as VHDL or Verilog, are used to describe the structure and behavior of electronic circuits, which can be employed in the design of integrated circuits, including application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs).

LLM applications for Chip Design

This article explores the convergence of LLMs and the domain of hardware and chip design. The focus is on the potential applications and approaches of these AI models in enhancing productivity and efficiency in chip design. Generative AI offers promising solutions for improving Electronic Design Automation (EDA) algorithms and the entire chip design process. In particular, LLMs present an unprecedented opportunity to automate language-related chip design tasks. The most apparent task is code generation, in which the LLM is employed to generate code for chip designs, test benches, assertions, or scripts for EDA tools. In a question-answering setup, the LLM responds to technical questions about designs, tools, or infrastructures in natural language and explains complex design topics by understanding internal hardware designs. LLMs can also assist with bug triage by debugging design or tool issues based on logs or reports and finally generating bug summaries with technical details. Another potential task would be the analysis of data and the preparation of reports, including summaries that focus on various details and audiences.

Nowadays LLMs are typically pre-trained with trillions of tokens (e.g., the Llama 3 model was trained with over 15 trillion tokens). Tokens are the basic units of data processed by LLMs and consist of a single word or parts of a word. The training data contains a significant amount and variety of source code, which is publicly accessible via GitHub, for instance. Through this and the emergent abilities learned during training, the pre-trained models already exhibit a basic understanding of programming, even for hardware description languages that are comparatively underrepresented in the training data. Figure 1 shows some hardware design tasks that were solved using prompting the GPT-4 model of OpenAI.

Prompting a pre-trained LLM to solve hardware design tasks. User prompts are from Thakur et al.⁴, responses are generated with GPT-4 (Temperature=0).

Nevertheless, the reliability of the models and the quality of the answers are crucial, particularly in safety-critical application scenarios. This article examines advanced approaches aimed at increasing correctness compared to just prompt default LLMs. It also considers the verification of actual performance, which is crucial for the practical application of these models in real-world scenarios. The aim is to present a comprehensive view of how generative AI can be applied in the chip design process. To this end, the subsequent sections will examine various methodologies for enhancing the performance of the models.

Incorporating feedback from simulations

During the design process, the developed code is typically evaluated and verified through the use of test benches and simulations. These components can also be integrated into an LLM pipeline.

In an iterative setup, as illustrated in Figure 1, the output from Verilog simulations can be combined with the interactive capabilities of LLMs to enhance the Verilog modules with each iteration. The Verilog modules generated by the LLM are compiled and then evaluated using a test bench and a simulator. The input to the LLM is comprised of the initial design prompt for a module and the context from compilation errors and debugging messages, which highlight discrepancies between the expected and actual outputs. This enables the LLM to enhance Verilog designs in multiple iterations by automatically identifying and rectifying compilation errors and functional bugs. In a differnt setup, it is also conceivable to have the evaluation part conducted by a hardware design expert who provides interactive feedback to the LLM⁷ on the generated design, thereby optimizing it with regard to the specified requirements.

In a paper published by Thakur et al.¹, the authors describe an application of this approach, which they call AutoChip. For their experiments they used the HDLBits dataset which contains both problem descriptions and testbenches. Their findings indicate that incorporating context from compiler tools, such as Icarus Verilog, improves the efficacy of the generation, resulting in 24% more functionally correct Verilog code when compared to no feedback. The iterative feedback enabled the generation of valid Verilog that passed tests within four iterations and achieved an 89% success rate after 10 iterations.

Domain-specific adaptation of LLMs

In addition to the utilization of proprietary LLMs, such as GPT-4 or Claude 3, the use of open-source models, including Llama 3 and Mixtral, is also a viable option. Open source models can be deployed on premise, avoiding the security risks associated with sending proprietary chip design data to third-party LLMs via APIs. Furthermore, these models can be adapted to specific domains in order to enhance their efficacy in specialized applications, such as chip design. Figure 2 depicts two distinct training techniques for customizing the LLMs. A pre-trained foundation model can be further trained with the Domain-Adaptive Pre-Training (DAPT) method using a few billion tokens from chip design documents and code. The resulting domain-specific foundation model can then be aligned with supervised fine-tuning (SFT) to specific tasks or human behavior using domain-specific instructions. The data necessary for training may be comprised of a combination of natural language datasets, including hardware specifications, documentation, and data sheets, as well as hardware-related code, such as software components, register transfer level (RTL) code, and test benches.

Figure 2: Domain-specific adaption of LLMs with amount of training data and computational effort

It is also noteworthy that DAPT is an unsupervised method, whereby the data can be used in an unlabeled format. In contrast, for SFT, a labeled data set consisting of instructions with the corresponding responses must be available. There are also advanced fine-tuning techniques, such as Parameter Efficient Fine-Tuning (PEFT) and, as part of it, the Low-Rank Adaptation (LoRA) approach. This approach freezes the pre-trained model weights and injects trainable parameters into smaller adaptation models for efficient fine-tuning of downstream tasks.

In a paper published by NVIDIA, Liu et al.² describe the use of domain adaptation of LLMs for chip design to train domain-specific models, which they call ChipNeMo. The results demonstrated that Domain-Adaptive Pre-Training of language models can lead to superior performance in domain-related downstream tasks compared to the base models, without any degradation in generic capabilities. Their ChipNeMo-70B has been demonstrated to outperform the highly capable model GPT-4 on the tasks of engineering assistant chatbot and EDA script generation, while exhibiting competitive performance on bug summarization and analysis. Thakur et al.³ fine-tuned different pre-trained LLMs on Verilog datasets sourced from GitHub and a broad search of 70 textbooks about the Verilog HDL. The study revealed that fine-tuning resulted in LLMs demonstrating enhanced capabilities for producing syntactically correct code, with an overall improvement of 25.9%. Notably, only 11.9% of the completions generated by pre-trained LLM models were found to compile, in contrast to 64.6% of those produced by fine-tuned LLM models. In terms of the functional correctness of the designs, the approach resulted in an increase of only 6.5% compared to the original LLMs. In the paper by Liu et al.⁵, the authors constructed a synthetic dataset of problem-code pairs for SFT through a bootstrapping process involving code descriptions generated by LLMs. The authors also propose the benchmark VerilogEval, which has been tailored specifically for evaluating LLM performance in the context of Verilog code generation for hardware design, ranging from simple combinational circuits to complex finite state machines.

Providing precise context through retrieval

LLMs are typically trained using the Causal Language Modeling (CLM) approach, where the model learns to predict the next token in the sequence based on the previous tokens. This encourages the model to capture language patterns, grammatical structures, and semantic relationships, but not necessarily to generate factually correct text continuations. As a result, LLMs can generate inaccurate text, known as hallucination, which is particularly problematic in hardware design tasks where accuracy is critical. The so-called Retrieval Augmented Generation (RAG), visualized in Figure 3, incorporates fact-based background knowledge into the generation of the answer, providing LLMs with precise context for user queries. A semantic search is used to select relevant passages from a specialist database, which are then included in the prompt along with the question. This should force the LLM to rely on the facts and generate more accurate answers.

Figure 3: RAG approach for incorporating fact-based background knowledge into LLM answer generation

The authors of ChipNeMo² created a RAG benchmark with 88 questions on a collection of 1.8k documents, including architecture, design, and verification specifications, as well as test bench regression and build infrastructure documentation. Among their findings, RAG shows significant improvement in grounding the model to the context of a particular question. They also observed a significant 30% improvement in retrieval hit rate when fine-tuning a pre-trained retrieval model with domain data over a pre-trained state-of-the-art retriever. Nevertheless, retrieval still struggles with queries that do not map directly to passages in the document corpus, or that require more context than is present in the passage. Unfortunately, these queries are also more representative of the queries that engineers will ask in real situations. One way to address this issue is to use a domain-adapted language model (see previous section) for RAG, which significantly improved the answer quality on their domain-specific questions.

Formal Verification

With the increasing complexity of chip designs, test benches and simulations may no longer be sufficient to verify the correctness of the generated designs. Test benches typically employ constrained random or directed tests that are unable to encompass the entirety of the state space. It requires more powerful techniques in order to identify and address any potential corner case bugs. Formal verification is a mathematical and algorithmic solution that exhaustively verifies the design with all possible combinations of legal input values by exploring the entire state space. The use of formal verification is therefore more reliable than simulations-based verification, as it ensures that the implementation aligns with the design specifications and not merely tests the design functionality.

In their work, Gadde et al.⁶ employed formal verification to examine the existence of Common Weakness Enumerations (CWEs) in hardware designs generated by LLMs. CWE is a community-developed categorization system for common software and hardware weaknesses that can affect the safety and security of applications. In conducting their experiments, the authors utilized a self-generated dataset of SystemVerilog RTL code with three complexity levels, focusing on 10 CWEs. The findings indicated that approximately 60% of the hardware designs generated by pre-trained LLMs are prone to CWEs, which could potentially lead to safety and security risks. To mitigate this, designers using LLMs should provide a detailed description of the specification to enhance the probability of generating high-quality RTL code. They also saw that most LLMs are not aware of hardware CWEs. A domain-specific training of the model, as previously described in the article, could also be beneficial for this case.

Conclusion

It has been demonstrated that LLMs are already capable of generating hardware designs using hardware description languages. For relatively simple problems, the models exhibit satisfactory performance; however, they encounter difficulties when confronted with more complex tasks. The utilization of pre-trained models requires prompt engineering with specific prompting techniques, as this has a significant impact on model performance. This article did not address Prompt Engineering techniques in detail yet. However, other approaches that influence the model performance even higher were identified, including iterative generation, domain-specific training, and fact-based retrieval. It was also demonstrated that double-checking by an experienced engineer and reliable verification are essential for safety-critical applications. However, LLMs are already capable of assisting hardware engineers and can accelerate the chip development process.

In Blocklove et al.⁷, the authors also observed in thier case study that the LLM produce errors in aspects of both the specification and implementation, requiring the intervention of an experienced hardware designer. However, the use of the LLM can facilitate a rapid exploration and iteration of the design space, thereby enhancing the productivity of the hardware engineer. In their paper the authors present a case study in which a hardware engineer engages in an interactive co-architectural process with GPT-4 to develop a novel 8-bit accumulator-based microprocessor architecture according to real-world hardware constraints. The authors then proceeded to send the processor to tapeout, which suggests that their study resulted in the world’s first fully AI-written HDL for tapeout.

References

[1] Thakur, Shailja, et al. “Autochip: Automating hdl generation using llm feedback.” arXiv preprint arXiv:2311.04887 (2023).

[2] Liu, Mingjie, et al. “Chipnemo: Domain-adapted llms for chip design.” arXiv preprint arXiv:2311.00176 (2023).

[3] Thakur, Shailja, et al. “Benchmarking large language models for automated verilog rtl code generation.” 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2023.

[4] Thakur, Shailja, et al. “Verigen: A large language model for verilog code generation.” ACM Transactions on Design Automation of Electronic Systems (2023).

[5] Liu, Mingjie, et al. “Verilogeval: Evaluating large language models for verilog code generation.” 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2023.

[6] Gadde, Deepak Narayan, et al. “All Artificial, Less Intelligence: GenAI through the Lens of Formal Verification.” arXiv preprint arXiv:2403.16750 (2024).

[7] Blocklove, Jason, et al. “Chip-chat: Challenges and opportunities in conversational hardware design.” 2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD). IEEE, 2023.