What if a free AI model could write game code, run the tests, and move on to the next task — all by itself?
That’s exactly what I tested this week. Three experiments, three progressive failures, and finally: a task completed end-to-end by an autonomous AI agent, with no human intervention, for zero dollars.
The setup
I’m working on a DragonRuby project: reproducing the Sonic 1 bonus stage (Megadrive) using lookup tables, at GameBoy resolution (160×144). The code is Ruby (mRuby), tests use dr_spec, and the workflow follows git flow.
For automation, I’m using:
- Ralph TUI: an AI agent loop orchestrator. It reads a task list (PRD), launches the agent, detects completion, and moves to the next task.
- OpenCode: an AI agent CLI (like Claude Code, but multi-model).
- MiniMax M2.5 free: a free AI model scoring 80.2% on SWE-bench — nearly on par with Claude Opus (80.8%).
The experiment
I gave Ralph a simple task: create a sin/cos lookup table module in fixed-point Q8 (integers, no floats — just like the Megadrive did). With specs, tests, and a commit.
Attempt 1: the silent failure
Ralph runs 5 iterations. M2.5 writes the code. It’s clean. But:
- It never ran the tests (DragonRuby binary not found — it lives outside the project directory)
- It never signaled completion (Ralph expects a specific signal:
<promise>COMPLETE</promise>) - Result: 7 minutes wasted. 0 tasks completed.
Attempt 2: the debug loop
I fix the DragonRuby path in the docs. Ralph relaunches. M2.5 finds the binary, runs the tests… and discovers the test runner is broken. It spends 10 minutes debugging dragon_specs.rb (a file copied from another project with hardcoded requires pointing to files that don’t exist). I kill it.
Attempt 3: completion
I fix everything: the test runner, mRuby-incompatible syntax, non-existent matchers. I rewrite the PRD with ultra-precise instructions: numbered steps, full binary path, exact matcher list, mRuby pitfalls documented.
Result:
| Exp 001 | Exp 002 | Exp 003 | |
|---|---|---|---|
| Tasks completed | 0/3 | 0/3 | 1/1 |
| Duration | 7m28 | 10min+ (killed) | 2m23 |
| DR tests | Never ran | Crashed | 25 tests, exit 0 |
| COMPLETE signal | No | No | Yes |
| Cost | $0 | $0 | $0 |
What I learned
1. A free model can code — but you need to chew the work for it
M2.5 produced correct Q8 code on the very first attempt. But it didn’t know:
- Where to find the test binary
- That mRuby doesn’t support the same syntax as standard Ruby
- Which test matchers actually exist in the framework
- How to signal Ralph that it’s done
Each of these points required a fix in the project docs.
2. The real work is the template, not the code
Across 3 experiments, I spent 80% of my time writing docs and 20% watching the AI code. It’s counterintuitive: you think AI will save you time, but the time shifts toward preparation.
The good news: template work is cumulative. Every error fixed in the docs prevents dozens of future failures.
3. The cost is unbeatable
Claude Opus costs ~$0.50–1.00 per run ($5/$25 per million tokens). M2.5 free = $0. Over dozens of Ralph iterations, the difference is massive.
And the code quality? Nearly identical. Both produce correct Q8, clean specs, Sandi Metz compliant methods.
What’s next
I have 7 TDD issues lined up on the project. The goal: get Ralph to run the first 3 end-to-end, with no intervention. Then benchmark different free models on the same tasks.
Follow this series
This is the first post in a series about automating game development with AI agents. Coming up:
- Running Ralph on 3 chained tasks with dependencies
- Benchmarking different free models on identical tasks
- Building a rotating checkerboard Megadrive-style — entirely coded by an agent
You can find me on GitHub or follow this blog — every step is documented in detail, code included.