I recently took a week off work to recharge and ended up going on a bit of a binge planning and building out a new system. It gave me a chance to explore some non-Gemini/non-Google stuff with more energy than I’m normally able to in my spare time, and I figured I’d share some thoughts:
Models
No real surprises here, but Opus 4.6 is really fantastic at planning and reviewing. Sonnet 4.6 doesn’t seem to produce any worse code than Opus, but it does make more mistakes when it comes to decisions. Codex 5.3 is by far the fastest, and also the most focused and direct. Gemini is faster than Claude, tends to take a more meandering/thorough route than Codex. Opus feels like the best partner of the batch in terms of design, but they all have useful aspects. Where I’ve settled is to iterate in Opus and periodically run it by ChatGPT and Gemini for feedback, which has been fruitful. I’ve done most of the early coding work with Claude because Claude Code is just a little ahead of the other CLI tools, but the models are all good enough for pretty much anything.
Workflow
This was a greenfield project and is now about 100k lines so it went through a few phases pretty quickly over ~30 hours. I spent a lot of time in a chat session just planning it out before building anything, so I started the build with a 60+ page design doc and a similarly sized architecture doc that I’d iterated on over probably ~10 hours. Claude came up with a pretty good phased approach, so I had it go through this for a few steps. The first couple of phases I kept a tight leash but after a while I went to YOLO, once I’d had enough patterns established.
As I iterated, I would use the Claude chat, which was now managing the docs in GitHub in a branch. This made it much easier to review via PRs that resulted from the decisions to make sure there weren’t any side effects or lossiness. The chat will create GitHub issues based on changes. Then I go to claude code/codex/gemini and tell it to fix a specific issue or just fix them all. Claude takes 10-20 minutes to handle most things, up to 40 for bigger batches or bigger changes. Sometimes it does them in parallel, sometimes not, I don’t think it’s really dialed in yet on where to split things up, but it errs on the side of serial so it almost never conflicts with itself.
Code
I don’t review the code closely but I do read it and it all looks really good. There aren’t many examples of the issues we’ve come to expect from these things. No significant cases of overcommenting, creating multiple version of the same thing, naively structured files/classes. I think this is a combination of:
- The models getting better
- Starting from scratch, no legacy decisions to consider or tech debt or “this is how we used to do it”.
- Having a thorough (though not formal in any sense) design and architecture spec with derived artifacts like roadmaps. Major changes are tracked in ADRs so it’s only tried to undo that once.
Context window and compacting are challenges at this point for design, less so for coding, as the fairly rigorous design approach yields tighter iteration loops, scopes, and smaller blast radii for changes.
Biology
I’m not tooting my own horn here, as this is much more “this is what these things can do if you let them”, but what I’ve built in a week, both in terms of capabilities and polish and raw metrics (200+ pages of design/docs/tutorials, 100k lines of code, 1k+ tests, dozens of E2E tests) is way beyond 10x. I’m a fairly prolific coder when possible and a good big picture thinker but in line with this has been exhilarating and exhausting in a novel way. It’s less like a creative Flow state where time slips away and more like a good video game. “Just one more feature” feels alot like “just one more quest”. I don’t think I could keep this up indefinitely, or it would at least take a while to adapt. A typical session looks like this:
- Run through the app, trying previous/new things, typing notes into the design chat.
- Iterate a bit there, it updates docs, creates issues.
- Have the agent work on the issues.
- Repeat, doing step 1 while the previous iteration of step 3 is happening.
The step change is that this is a ~30 minute cycle, not a 2-3 week sprint, and these can be pretty significant or deep changes. It’s literally building things faster than you can design and try them (not even including the self improvement loop). And it’s doing them well, this isn’t a simple project and it’s not making garbage code. It’s novel because it’s more productive than Flow but also less comfortable. I’ve only been spending like 3-4 hours a day on it and my brain and dopamine circuits still haven’t really figured out to react to it yet, so you end up in a contradictory state of doing smart things with your lizard brain. That said, it’s been really fun and I recommend trying it if you can!