Generative Artificial Intelligence for Circuit Design Automation

Motivation

Generative Artificial Intelligence (GAI) has exploded into the public zeitgeist since the release of ChatGPT a few years ago. Companies, developers, and individuals are quickly seeking to exploit this valuable new technology. Our work sought to evaluate exactly how effective GAI is at produce RTL code for specifying circuits. One other work had established a GAI model fine-tuned on verilog code specifically, but all of the other commercially available models had no specific training in verilog related concepts. Additionally, the dearth of high-quality, production-grade RTL makes inferences by general purpose GAI systems highly constrained. How, therefore, can someone adopt this technology into their synthesis pipeline in any trustworthy capacity?

Challenges of Closed Model Evaluation

The answer for models where you don’t have access to the execution and inference in-progress is that “you can’t.” The synthesis capabilities of these systems are highly constrained by the fact that any of their output is of an unknown quality. One of the primary goals of this work was to associate some correlation between prompts and the quality of their outputs; however, we struggled to be able to quanitfy this performance on closed-source models becuase of the lack of explainability.

Explainability being reduced in closed models is by no means a novel observation; but when the question being asked is ‘can I trust what I’m seeing’ and the answer is ’ ’ what is an engineer to do? Our hypothesis was that prompts could be a way to ‘inject’ structure and therefore some degree of expalainability and trust into the output of the LLMs being evaluated. Some of the literature suggests that directing LLMs to explicitly discuss their intermediary steps leads to more correct responses.

Our work found no specific correlation between prompts and the quality of their responses. Problems which are easier (objectively requiring fewer tokens; subjectively being discussed more in the training data) should have been completed at higher rates than more difficult problems. This largely held, but when more granularly examined, the specific capabilities of the models evaluated did not follow this trend.

Every one of these problems could be controlled for and examined. It would have been difficult and time-consuming: but that’s research. The ultimate problem which halted this entire line of enquiry was the opacity of the models in question. Bard, as it was known at the time though it has since be renamed to Gemini with the introduction of a new model, had no specific information about the pre-transofmrations, model versions, and inference parameters used to generate the solutions. If another researcher wanted to validate our work, would they be able to produce similar output at all? The signs all pointed to no. If there is an inability to replicate results, what value does examining these models from a consumer side provide? These are questions an AI researcher may well be qualified to answer, but an application focused researcher such as myself lacks the time and resources to do so.

Challenges of Specification Benchmarking

Beyond the technical limiations imposed by the environments available to interact with these systems, there are larger structural problems with benchmarking specification implemenation as I began to allude to in the previous paragraph. Specifications are, by definition, high level descriptions of the system. There is room for interpretation. There is ambiguous language. The exact implementation of the specification could be varied in any numbr of ways while still satisfying the spec.

How could this be evaluated? I had planned to apply an open-coding methodology to assess the quality of responses provided by GAI systems. This would be evaluated in terms of how much effort would be required by a human overseeing the application of these tools in order to ‘complete’ the design to specification. Developing a strong IRR would obviously be paramount to this process, but I think it’s worth discussing even more specifically what achieving that IRR would entail.

To agree on how well a given piece of RTL code satisfies a specification, the rating team must collaboratively and iteratively come to a consensus about the design being expressed by the specification. This process is very natural for teams of humans, and I have no doubt that a team of knowledgeable hardware engineerings could agree on the problem specifications for these simple RTL designs with little trouble; however, there is no guarantee that the GAI model would share this derivation of a design from the spec.

This points to a larger, more structural problem in this space. There are two problems baked into the one question this work was asking: what internal model of the design does the GAI construct, and how effectively is that design implemented in the generated response. This distinction is important in the AI space as it asks fundamental questions about the boundaries of knowledge in highly specialized systems. Without answering these questions, any behavioral analyses of GAI algorithms will be necessarily anthropomorphic as we lack a concrete understanding of the higher level mechanisms within a GAI agent.

Quickly Becoming Stale

An additional interesting feature of the work I presented here was how quickly it became out-dated. As a review of the synthesis capabilities of generative AI tools, it was out of date by the time I got to the poster session. The speed at which new models are being developed makes benchmarking and evaluating any specific model for a task outside of the standard AI benchmarks very difficult in a way that I hadn’t forseen.

Any specific work about generative AI is clearly going to run into similar problems. If the frontier models are already outdated by the time a report comes into the scientific community, how would researchers create a valuable discussion? I think the clear answer is by providing structural, repeatable tools for evaluating the performance on benchmarks rather than discussing the benchmark performances themselves; however, this clearly answers different questions than those explored in a soley evaluatory work.

The research presented in this discussion highlights the challenges and limitations of using Generative Artificial Intelligence (GAI) for circuit design automation. The lack of explainability and transparency in closed models hinders the ability to trust the quality of their outputs. Additionally, benchmarking the implementation of specifications poses difficulties due to the ambiguity and interpretation involved. The rapid pace of model development further complicates the evaluation process, making it challenging to keep up with the latest advancements. To address these issues, it is crucial to focus on developing structural and repeatable tools for evaluating GAI performance on benchmarks. By approaching the problem from this meta-level, researchers can provide valuable insights and discussions that go beyond the specific models themselves.

Photo by Mika Baumeister on Unsplash