Stop Multitasking. Parallel Agent Workflows Are Making You Slower And Burning You Out
You’re waiting 60 seconds for your model to respond, watching the spinner, fingers drumming. It feels like wasted time. So when Conductor and Cursor 3 say “run other agents in parallel while you wait,” it sounds right. Keep your hands moving. Don’t sit idle. But you’re fragmenting yourself across five problem spaces while your attention thrashes between them. You’re managing anxiety, not producing work.
I’ve had Claude Code Max since launch. I’ve been on Conductor since the alpha. Cursor 3 shipped this week with “run many agents in parallel” as a headline feature. JetBrains Air launched last month with the same pitch. I ran at parallel workflows for months. Isolated worktrees, different tasks, one dashboard.
I was faster single-threaded.
The speed tax: model anxiety in numbers
When you default to Opus because you’re not confident Haiku will catch the edge case, you wait 50-80 seconds for a response. That dead time feels wasted. Conductor and Cursor 3 say: fill it with another agent. Haiku is 3-4x faster than Opus for typical coding tasks. You’re paying a 3x latency tax. But the tax costs more than seconds.
The cognitive cost of filling dead time
Gloria Mark at UC Irvine found it takes 23 minutes to recover full focus after an interruption. [1] Carnegie Mellon found engineers on five concurrent workstreams spend 80% of their time on switching overhead. [2]
Bug rates double under fragmented attention. [3]
Reviewing code IS cognitive work. Siegmund et al.’s 2014 fMRI study showed code comprehension activates five brain regions associated with working memory, attention, and language processing. The same regions active when writing code. [4] Their 2017 follow-up found that when programmers could rely on semantic cues (meaningful variable names, familiar patterns), brain activation dropped. The brain worked more efficiently, not less. Strip those cues away and activation spiked, particularly in Brodmann areas 6, 21, and 44 in the left hemisphere. Your brain burns more working memory to compensate. [5]
Floyd et al. confirmed this from a different angle. In a 29-participant fMRI study, code review and code comprehension produced distinct but overlapping neural signatures. Reviewers could be distinguished from comprehenders by brain activity alone with 79% accuracy. Code review recruits additional judgment and evaluation processes on top of the comprehension load. [6]
Ivanova et al. at MIT pinned down which system does the heavy lifting. Code comprehension activates the brain’s “multiple demand” network, a frontoparietal system shared with math, logic, and executive reasoning. It does not activate the language system. [7] Reading code looks like reading prose. It isn’t. Your brain treats it as a problem-solving task, recruiting the same executive resources you need for planning and sustained attention. Peitek et al. confirmed this in a 2021 follow-up (ICSE Distinguished Paper): as code complexity grows, activation increases in BA6, BA21, BA39, and Broca’s area (BA44/45), scaling the demand on working memory and attention in proportion to the complexity of what you’re reading. [8]
You load a mental model of an agent’s changes. You check it against your understanding of the system. You decide whether the approach holds. That sequence, comprehend then judge, stacks two expensive operations drawing on the same frontal and parietal regions. Running five review contexts in rotation degrades each one.
Sophie Leroy’s research on attention residue puts a name to the mechanism. [9] When you switch away from a task before finishing it, cognitive processing from that task persists, consuming working memory you need for the next one. You don’t cleanly swap contexts. You drag fragments of the previous review into the next. Parnin and Rugaber measured this in programmers: across 10,000 recorded sessions from 86 developers, only 10% resumed coding within one minute of an interruption, and interrupted tasks took roughly twice as long to complete. [10]
So the equation is:
- Opus takes 60 seconds (feels unproductive)
- You spin up four more agents to fill that time
- You’re now context-switching between five problem spaces
- Your bug rate doubles
- You take 23 minutes to refocus when you’re done
The dead time wasn’t the problem.
Why model anxiety exists (and why it’s often wrong)
Opus scores 80.8% on SWE-bench Verified. Haiku scores 73.3%. A 7.5% gap feels significant. You don’t know which side of that gap your task falls on, so you default to Opus to be safe.
That 7.5% gap is concentrated in maybe 5-10% of your work. The complex multi-file refactoring. The subtle architectural decision. The bug that requires holding three systems in your head at once.
The other 90% (linting, test writing, error handling, simple refactors, boilerplate) both models handle the same. I benchmarked them on real coding tasks across the full spectrum. Trivial fixes, simple refactors, moderate test writing, complex multi-file work, hard race condition debugging. Across all task types, quality matched. Opus was slower.
If your work is mostly complex multi-file refactoring and subtle architectural decisions, your ratio will look different. But be honest about how much of your day is that versus the routine work surrounding it.
You’re paying a 3x latency tax to insure against a failure mode that doesn’t happen 90% of the time. And while you’re paying that tax, your brain is fragmenting, you’re tempted to fill the dead time, and you end up slower and more error-prone.
The teams behind Conductor and Cursor 3 see a real problem. Engineers feel slow. Engineers wait 60 seconds. Their solution: run agents in parallel.
They’re solving the wrong problem. Engineers feel slow because they’re using the wrong model. The fix is model selection, not more orchestration.
Five things that work
Five things made me faster than any parallel setup I tried:
Precise problem linking
Stop telling agents “I think there’s a bug in the auth flow.” Instead: “Fix the race condition in AuthProvider.tsx around line 118 where the token refresh fires twice before the first response returns.” Tag the files, give it the context, but let it decide the approach. Without a file path, the agent burns cycles grepping through your codebase trying to find the right file. With one, it reads it and starts solving. Over-prescribing the steps (“first do X, then do Y”) leads to worse outcomes than giving precise inputs and letting the agent figure out the how.
Aggressive model selection
Haiku handles 70% of what I throw at agents. Linting, small refactors, test scaffolding, boilerplate. Sonnet covers most of the rest. Most engineers default to the biggest model for everything. That’s like driving a lorry to the corner shop.
Use Haiku by default. Smaller models don’t reason worse. They reason with better tool grounding and less overhead. [11] You’ll know when something requires architectural reasoning. That’s when you reach for Opus.
Smaller tasks, not parallel tasks
A refactor that touches six files tempts you to reach for Opus and wait. Break it into six single-file changes instead. Each one finishes in seconds on Haiku or Sonnet. You review one result, fire the next, review, fire. The gap between tasks shrinks below the threshold where your brain starts looking for something else to do. Parnin and Rugaber’s data showed programmers recover fast from short interruptions but poorly from long ones. [10] Smaller tasks keep each interruption short enough that you stay in the same problem space.
I learned this building Ralph, an autonomous coding agent. Every layer of task orchestration I added (dependency graphs, parallel tracks, adaptive strategies) was complexity the agent had to work around rather than with. The version that worked best did one feature per session, sequential, committed to git. Don’t parallelize.
Low-reasoning modes wherever possible
Sonnet without extended thinking returns in seconds. The feedback loop gets tight enough that I hold the full mental model of what the agent is building. I catch drift before it compounds. An agent I correct after 30 seconds produces cleaner output than one that wanders for 10 minutes while I’m reviewing three other agents.
Verify programmatically, not visually
The parallel workflow pitch assumes you need to review agent output. You often don’t. Write a test, run it, move on. If the test passes, the code works. You skip the most expensive part of the loop: loading the agent’s changes into your working memory and judging whether the approach holds.
This is where the fMRI data has practical implications. Every visual review activates your multiple demand network. Every review you replace with a programmatic check is cognitive load you don’t spend. The agents that cost me the least mental energy are the ones I never review because they run their own verification and either pass or revert.
The exception: truly autonomous workflows
Parallel workflows work for fully autonomous tasks. I built auto-claude, an autonomous edit-commit-benchmark-revert loop. It runs for hours without intervention, tries dozens of experiments to complete a goal, and demands zero cognitive load from me. I can be deep in a focused single-agent session while auto-claude chips away at a performance optimisation in the background.
That’s what parallel agents should look like: autonomous, with no dashboard to check and no “input required” interrupts. The agent either passes its own acceptance criteria or it reverts and tries again.
Simon Willison, inventor of Django, described this problem: “I can fire up four agents in parallel and have them work on four different problems. And by like 11 AM, I am wiped out for the day.” He warns there’s “an element of sort of gambling and addiction to how we’re using some of these tools.”
Conductor, Cursor, JetBrains Air are shipping a different version of parallelism. One where you sit in the middle of five agents, reviewing their output in rotation, context-switching between problem spaces, approving diffs with half your attention. Managed distraction with a progress dashboard.
The real problem
You haven’t maxed out single-thread throughput if you’re reaching for parallelism now. The tooling makes it look easier than mastering the fundamentals.
A thought experiment: if Opus ran at 10,000 tokens per second, would you parallelise? I wouldn’t. I’d do everything in sequence, fast. If latency disappeared, no one would choose to juggle five contexts. Parallelism is a coping mechanism for slow models.
Use a faster model.
Update: Comprehension debt and the verification ceiling
Addy Osmani published Your parallel Agent limit making a complementary argument. I focused on why parallelism fails in the moment. He names the downstream cost: comprehension debt, the gap between how much code exists in your system and how much of it any human understands.
My argument has an implicit assumption: if the tests pass, the code is fine. True for correctness. Not true for maintenance. Agent-generated code that passes every test but that you never built a mental model of is a liability. A study of 110,000 agent-authored PRs found that agent code has higher churn rates than human-authored code, with more rework in subsequent commits. [12] The code ships. It passes CI. Then someone rewrites it.
Osmani also names “background vigilance,” the ambient anxiety of maybe something is going wrong in that other worktree. You never switch contexts. Your primary thread degrades anyway. My post covers the cost of switching; his covers the cost of not switching but knowing you should be.
His solution space is management patterns: check-in cadence, predictable status formats, scoped mandates. Mine is model selection and task sizing. Both miss the deeper variable: verification strength.
The ceiling on parallelism is how much of the review loop you can close without a human:
| Verification | Parallelism | Human role |
|---|---|---|
| Agent self-reports “done” | None | Must review everything |
| Tests + linters | Some | Reviews failures |
| Domain-specific multi-pass review | More | Reviews edge cases |
| Formal verification (proof checker) | Unlimited | Reviews nothing |
Most of us are in row two. You can parallelise to the extent that your tests are comprehensive, and no further. Mistral’s Leanstral demonstrated the extreme end: a formal proof checker as the verifier, enabling parallel speculative generation where you run N candidates and take the first that passes. [13] That works because the verifier is sound. Passing means correct. Normal test suites aren’t sound. Passing tests doesn’t guarantee correctness, which is why “write tests and parallelise” has a ceiling, and you keep hitting it.
Parallelism is bounded by verification strength on the correctness axis and comprehension debt on the maintenance axis. Buying better orchestration tooling won’t move either boundary. Faster models reduce your temptation to parallelise. Stronger verification raises the ceiling on how much you safely can. Comprehension debt accrues either way, and no one has a good answer for it yet.
Liked this? Get an email when I publish a new post.
Powered by Buttondown