As of June 2026 — caps change. This method still holds.
Rules that keep the lab honest
| Do | Don't |
|---|---|
| Copy each prompt exactly — character for character | Rewrite per platform "because it likes it better" |
| New chat per task per platform (clean compare) | Stack all four tasks in one thread and wonder why it got confused |
| Note model/mode (fast vs thinking) in your log | Compare Plus thinking on one app to free fast on another without saying so |
| Pick a winner you'd actually send or post | Crown the one that sounds most like a textbook |
Log cap hit or not on my tier and move on | Paste real data to "get a fair test" |
What you're not measuring
- Speed — unless slowness made the answer useless
- Word count — longer isn't better
- Brand loyalty — you're renting tools, not joining a cult
Winner log template
Copy this table. Add rows as you complete each task page.
| Task | ChatGPT | Claude | Gemini | Grok | Winner | One-line why |
|---|---|---|---|---|---|---|
| 1 Social posts | notes + model | |||||
| 2 Flash-day table | ||||||
| 3 Late-payment email | ||||||
| 4 Image (optional) | N/A typical |
Platform column tips: One short note each — e.g. good hooks, generic, best table, cap after 2 tries.
I still catch myself rooting for whichever app I paid for most recently. The log forces the honest question — "would I actually post this?" — and that's the only score that matters before you start building your own stack.
Continue — Task 1: social posts.