What if a free AI model could write game code, run the tests, and move on to the next task — all by itself?

That’s exactly what I tested this week. Three experiments, three progressive failures, and finally: a task completed end-to-end by an autonomous AI agent, with no human intervention, for zero dollars.

The setup

I’m working on a DragonRuby project: reproducing the Sonic 1 bonus stage (Megadrive) using lookup tables, at GameBoy resolution (160×144). The code is Ruby (mRuby), tests use dr_spec, and the workflow follows git flow.

For automation, I’m using:

The experiment

I gave Ralph a simple task: create a sin/cos lookup table module in fixed-point Q8 (integers, no floats — just like the Megadrive did). With specs, tests, and a commit.

Attempt 1: the silent failure

Ralph runs 5 iterations. M2.5 writes the code. It’s clean. But:

Attempt 2: the debug loop

I fix the DragonRuby path in the docs. Ralph relaunches. M2.5 finds the binary, runs the tests… and discovers the test runner is broken. It spends 10 minutes debugging dragon_specs.rb (a file copied from another project with hardcoded requires pointing to files that don’t exist). I kill it.

Attempt 3: completion

I fix everything: the test runner, mRuby-incompatible syntax, non-existent matchers. I rewrite the PRD with ultra-precise instructions: numbered steps, full binary path, exact matcher list, mRuby pitfalls documented.

Result:

  Exp 001 Exp 002 Exp 003
Tasks completed 0/3 0/3 1/1
Duration 7m28 10min+ (killed) 2m23
DR tests Never ran Crashed 25 tests, exit 0
COMPLETE signal No No Yes
Cost $0 $0 $0

What I learned

1. A free model can code — but you need to chew the work for it

M2.5 produced correct Q8 code on the very first attempt. But it didn’t know:

Each of these points required a fix in the project docs.

2. The real work is the template, not the code

Across 3 experiments, I spent 80% of my time writing docs and 20% watching the AI code. It’s counterintuitive: you think AI will save you time, but the time shifts toward preparation.

The good news: template work is cumulative. Every error fixed in the docs prevents dozens of future failures.

3. The cost is unbeatable

Claude Opus costs ~$0.50–1.00 per run ($5/$25 per million tokens). M2.5 free = $0. Over dozens of Ralph iterations, the difference is massive.

And the code quality? Nearly identical. Both produce correct Q8, clean specs, Sandi Metz compliant methods.

What’s next

I have 7 TDD issues lined up on the project. The goal: get Ralph to run the first 3 end-to-end, with no intervention. Then benchmark different free models on the same tasks.

Follow this series

This is the first post in a series about automating game development with AI agents. Coming up:

You can find me on GitHub or follow this blog — every step is documented in detail, code included.