This content has been generated by GLM 4.7 AI model
So, you know how everyone's benchmarking LLMs these days? Like, "oh look, this model scored 87.3% on this synthetic test suite" or whatever. Boring, right? I mean, who cares if a model can reverse a string perfectly when you'll never ask it to reverse a string in real life? 😛
That's why I built RaceBench — a benchmarking tool that actually tests something useful. How well different LLMs can write real, working JavaScript code. Not some "hello world" nonsense, but actual creative coding tasks — games, visualisations, interactive stuff you might actually want to build.
The idea (because every project needs an origin story)
One day I was curious: if I ask different AI models to write the same game, which one would actually produce playable code? Not "technically correct but crashes on the third frame" code, but stuff you could run and enjoy.
So I created a benchmark where I ask models to build a 2D scroll shooter game using Three.js. Not exactly "easy mode" for an AI — it requires understanding 3D graphics, game loops, user input, and making everything work together without blowing up the browser. Spoiler: some models handle it beautifully, others... well, let's say it's educational to watch them try 😊
What it actually does
RaceBench is a static HTML dashboard that displays results from testing various LLMs on JavaScript code generation. Here's the cool part:
- Real-world benchmarking — we test creative coding tasks, not synthetic "reverse this string" nonsense
- Executable results — every model's output can be run directly. See the code, click preview, judge for yourself
- Multi-dimensional analysis — not just "did it work?" We track prompt tokens, completion tokens, TTFT (time to first token), generation time, and success rate
- Educational value — compare how different models structure their code, their approaches to the same problem, and learn from the variety (even the bad ones!)
How it works (the technical bits, minus the boring parts)
The setup
Everything's static — pure HTML/CSS/JS with no backend. Deploy it anywhere. GitHub Pages? Sure. Netlify? Why not. Your Raspberry Pi at home? Absolutely. The whole thing weighs about as much as a small hamster 🐹
bashgit clone https://github.com/DimitriGilbert/racebench.git # Deploy to GitHub Pages (or your jungle site of choice) # I mean, whatever floats your boat
The test case
The benchmark uses a 2D scroll shooter game built with Three.js as the test scenario. Each model gets the same prompt and has to produce working code. We extract the last ````runjs` code block from each response (because models love to chat before giving you the goods, you know?).
What we track
For each model, RaceBench captures:
- Prompt tokens — how much it needed to understand what we wanted
- Completion tokens — how verbose it got with the solution
- TTFT (Time To First Token) — responsiveness metrics
- Generation time — total time to produce the code
- Success rate — did it actually run? Or did it explode?
The UI (because I care about aesthetics, surprisingly)
Nice dark mode UI that doesn't look like it was designed by a sysadmin in 2003. Two view modes:
- Card grid — pretty cards for each model with key metrics at a glance
- Sortable table — for when you want to data-nerd out and sort by TTFT or token count
Click any model and you get:
- Live preview (opens in a new tab, runs the actual code)
- Raw response view (see exactly what the model said)
- All the metrics broken down
Versions (because software never ends)
v1: The brave beginning
Tested 18 models. Learned a lot. Realised some models are impressively creative while others are... trying their best. Let's put it that way. 😛
v2: The refinement
Narrowed it down to 10 models. Better metrics extraction. Cleaner UI. Still watching models struggle with Three.js in entertaining ways.
Tech stack (kept it simple, you're welcome)
- Node.js — because JavaScript is life
- Tailwind CSS (via CDN) — styling without the build step headaches
- Simple-Datatables — for when you want sortable tables but don't want to write them yourself
- Three.js (dynamic) — loaded on demand for the live previews
Minimal dependencies. No complex build pipelines. Just works. I like that philosophy.
The models (participants in the chaos)
We've tested a bunch of the big names:
- Claude Sonnet 4
- DeepSeek R1
- GPT-4.1
- Gemini 2.5
- Grok 4
- Kimi K2
- Qwen3
- ...and more
Each one brings its own personality to the code. Some are terse and efficient. Others are verbose and over-engineered. A few produce code that makes you question everything you know about software development. All educational in their own way 😊
See it in action
Don't just take my word for it — go play with the actual dashboard:
Click around. Run some code. See which model you'd want on your team. Spoiler: the answer might surprise you (or it might be exactly what you expect. LLMs are weird like that).
What makes this different from other benchmarks?
I'm glad you asked (I know you didn't, but pretend you did) 😛
- Real tasks, not toy problems — we test actual creative coding, not "sort this array" nonsense
- You can run the results — not just a score, actual working code you can inspect and execute
- No server needed — static HTML means you can host it anywhere or even run it locally
- Transparent metrics — see exactly what we measured and how
- Educational — compare approaches, learn patterns, see what works (and what definitely doesn't)
What I learned (because building stuff teaches you things)
- Some models are surprisingly good at 3D graphics code
- Others clearly haven't seen many Three.js examples in their training
- TTFT doesn't always correlate with code quality
- Verbose models aren't necessarily better — sometimes they're just... chatty
- The best model depends on what you value: speed, efficiency, code clarity, or just getting something that runs
Future plans (the "if I get around to it" section)
I'd like to:
- Add more test cases (different types of creative coding challenges)
- Test more models (new ones keep popping up like mushrooms)
- Improve the metrics (maybe add code quality scoring?)
- Add a "leaderboard" view for those who love rankings
- Maybe, possibly, contribute the data to some broader benchmark effort (if such a thing exists and wants this kind of real-world data)
But let's be honest — I'll probably add a new test case before I remember to do any of that. Priorities, right? 😊
Wrapping up
RaceBench isn't trying to be the definitive AI benchmark. It's just me trying to answer a simple question: "Which AI writes the best JavaScript code for real projects?" And having fun seeing them all try.
If you find it useful, great. If you learn something from the different code styles, even better. If you just enjoy clicking through live previews of AI-generated games and thinking "wow, that actually works," then I've done my job.
Feel free to fork it, add your own models, test different prompts. The more data, the better. Just don't ask me which model is "best" — that depends on what you're building, your timeline, your budget, and probably the phase of the moon. LLMs are like that 😛
GitHub: DimitriGilbert/racebench
Now go race some models. May the best code win.
{{% goodbye %}}