Code That Writes Code Needs Multi-Level Tests

Code that writes code is fun in a meta sort of way, but it requires extra testing.

We’ll start by considering generic systems and their tests, then look at testing for code generation. 

“I’d rather write programs to write programs than write programs.”

Dick Sites, quoted in More Programming Pearls

The Black-Box View

Consider a system as a black box: 

A system as a black box, with inputs and outputs

One way to test is to do system-level testing: for a given set of inputs, does the system produce the right outputs? 

A system-level test, comparing output to expected output

These tests have challenges, though:

  • It can be hard to create the expected output. Consider code that produces binary data: even with higher-level tools, it’s tedious to specify the values.
  • It can be hard to get the expected output right. Consider an optimizing compiler with multiple inter-related optimizations. Can we accurately predict their impact?
  • Multiple outputs may be acceptable. For a web page, we might care that something is centered, but not care how this is done. Many combinations of HTML and CSS could yield the right effect. Our “expected output” may be a family of possible outputs, and the “diff” must be ready for this.
  • Our system may be non-deterministic. Sometimes we can make it deterministic (e.g., seed a random number generator), but some results may be inherently non-deterministic. For example, you may be using threads to spin off work, and they may complete in different orders. We’re back to the problem of multiple acceptable outputs.
  • The tests tend to be relatively slow – many systems are slow to spin up.

Inside the System – Unit Tests

Systems are built from many smaller parts – lots more than this: 

Inside a system are many components

We can get more confidence in our system if we know its parts work. This could come from TDD, TCR, test-after, or whatever.  

Testing units is usually a smaller and simpler problem than testing the whole system, and smaller problems are easier to deal with:) In effect, each unit (or cluster of them) is itself another system.

Code That Writes Code

Some systems produce their effect by converting a specification of some sort into code, that is then run.

Examples include a compiler, a domain-specific language, a control panel, even an “AI” tool.  

A system with a spec as input and code as output

It’s like the first picture – but the generated code is a new system, itself amenable to testing!

Testing Code That Writes Code

You might test your generated code in several ways:

Unit Tests Pay Off Well

Using TDD, TCR, or heavy unit testing pay off well for code-writing programs. Situations that are tricky to set up in the full input can be more easily created in unit tests. Outputs that are hard to test in the full output may be much easier to explore with microtesting. 

High-Level Testing May Be Needed

For complex systems, at least system-level testing is needed. But, as mentioned above, it’s challenging to specify acceptable output. (Have you ever looked at generated code? It’s often… opaque.)

We can do a system-level test, looking at the generated code

We can run classic “diff” tests, provided we can specify the expected output.

Check the Syntax

You may have noticed something about unit tests and system tests: they’re built around expectations. 

In this case, those expectations are code. If there’s something we’ve learned about writing code, it’s that it’s very hard to get it right the first time. 

That opens a scary possibility: we’re getting the code we expect, but it’s not what we want. 

One way to check our code is to use another tool to validate it – a syntax checker (possibly as part of a compiler). If our expected code is right, it certainly should pass that check. (I’ve seen samples of AI-generated code that fails this test.) 

Feed generated code to a syntax checker

In principle, this is a step forward, but it also has challenges – is the tool that checks the syntax perfectly conformant with the language spec, or does it have its own foibles?

Does The Generated Code Work?

If I write a bunch of code, and it all compiles, does it necessarily work? Of course not. (You might even want to bet the other way.)

To really know that our code-writing code works, we want to make sure that the code it generates really works. 

Test the generated code with its "real" input and expected output

Now we need a whole new set of test cases, not in the form of our input language, but the inputs to our final running system. 

Combinations

To build code that writes code, we need multiple tests:

  • You’ll almost certainly benefit from unit level / microtests.
  • You probably want at least a few system-level tests for your code generation.
  • You’ll absolutely want the second-level system tests: you must make sure the whole system works, end-to-end-to-end. (As the generated code has to be compiled or interpreted, you’ll also get syntax checking from that process.)

We can see both levels of tests in one picture: 

Two-level checks: the generated code, and its effects

What About TDD?

I’ve long been an advocate of TDD, and I’d absolutely use it for the code-writing system.

But note that you can also use it on the generated system. Nothing says those tests have to appear from your head all at once; you may find it pays to develop those tests incrementally too.

Conclusion

Code that writes code creates a two-level system, and it’s important to test both levels. If you neglect testing the generated code, your users will have to test it for you.