Agents Just Jumped From 12% to 66%. But They Still Can't Read a Clock.
The Signal: Stanford's 2026 AI Index dropped this week. The headline: AI agents went from 12% to 66.3% success on real computer tasks in twelve months. The footnote everyone is missing: the same models that solve Olympiad math read an analog clock correctly only 50.1% of the time. This is the jagged frontier. And if you delegate to agents without understanding it, you will lose money fast.
Most coaches are reading the wrong half of the report.
They see the 66% number.
They get scared. Or excited. Or both.
Then they go hand their entire onboarding flow to an agent... and wonder why three weeks later, two clients ghosted and one billed twice.
Here's what the report actually said.
April 2025
April 2026
OSWorld is the benchmark. It tests agents on real computer tasks. Spreadsheets. Browsers. Email. The kind of stuff your VA does every Monday.
One year ago, agents could barely click a button without crashing.
Today they're 6 percentage points off human-level performance.
That's the part that gets the headlines.
This next part is the part that matters.
The Jagged Frontier
Stanford's researchers used a phrase that should be tattooed on every coach's forearm before they touch an agent:
The jagged frontier. The same model that solves Olympiad-level mathematics reads an analog clock correctly only 50.1% of the time.
Read that twice.
The thing that can write your entire 90-day onboarding curriculum...
Cannot reliably tell if it's 3:15 or 3:45.
That's not a bug. That's the shape of intelligence right now.
Agents are not generally smart. They are spikily smart. World-class at some things. Worse than your nephew at others. And nothing about the way they look or talk tells you which is which.
Now look at your own business through that lens.
Some of your workflows are Olympiad math. Defined inputs. Defined outputs. Repeatable. Verifiable.
Some of your workflows are analog clocks. They look simple. They're actually weird. The agent will fail in ways you don't catch for two weeks.
The Delegation Map Most Coaches Get Wrong
Here's the rule most coaches have been working with:
If a human can do it... an agent can do it.
That rule is dead.
The new rule is sharper:
If the task has a defined start, a defined end, and a verifiable output, agents are ready. If the task requires judgment under ambiguity, keep it human.
Map it across what you actually do in a week.
Delegate to Agents This Quarter
- Lead intake and qualifying questions
- Calendar booking and reminder sequences
- Invoice generation and follow-up
- Onboarding email sequences with defined steps
- Repurposing one piece of content into 5 formats
- Pulling client data into a weekly dashboard
- SEO meta and schema generation
- FAQ responses with documented answers
Keep Human This Quarter
- Diagnosing what a client actually needs
- Reading energy on a discovery call
- Pricing decisions for unusual deals
- Holding space when a client is in crisis
- Deciding when to fire a client
- The first 60 seconds of any new relationship
- Choosing what to build next
- Anything that requires saying "no" with love
Notice the pattern.
The left column is verifiable. You can look at the output and know if it's right.
The right column is felt. You only know it's right by being there.
Agents are world-class at the verifiable. They're terrible at the felt.
What This Means For Your Stack This Week
Stop asking "can an agent do this."
Start asking "can I write a test that proves it did it right."
If you can write the test, ship the agent.
If you can't write the test, that's still your job.
This single reframe will save you from the wave of coaches who are about to automate the wrong half of their business and watch retention crater.
The Deeper Read
Here's the part the AI Index doesn't say but every coach should know.
The jagged frontier isn't just a model problem. It's a human problem. We are also spikily smart. We are world-class at the parts of our craft we have practiced ten thousand times. We are clumsy at everything outside that.
The shift coaches need to make isn't "humans vs agents."
It's "what is my actual zone of mastery, and how do I delegate everything else to the layer that is best at it."
Sometimes that layer is an agent.
Sometimes that layer is a human team member.
Sometimes that layer is you, fully present, doing the one thing only you can do.
Stanford gave us the data. The interpretation is on us.
Your Move
Open your calendar from last week.
Look at every block of time you spent on something that was not a client transformation.
For each block, write one of three letters next to it:
A if an agent could do it with a clear test.
H if it needs a human but not you.
Y if only you can do it.
Count the A's.
That's the number of hours per week you can buy back this quarter.
Most coaches I work with find 12 to 18 hours hiding in the A column.
That's a full second business worth of capacity sitting on the table.
Pick the highest-frequency A task. Build the agent. Write the test. Ship it this week.
The 66% number is real.
So is the 50%.
Coaches who learn to tell the two apart will own the next decade.
Want help mapping your A/H/Y workflows?
Free 30-minute AI strategy call. We'll go through your week, find the A column tasks, and design the first agent that buys you back 10+ hours without dropping a client.
Book Your Free Call →