The answer is always judgment (and that’s the problem)

Every team that adopts AI in their development workflow eventually has the same meeting:

Something broke. A bad merge, an outage, a security hole that shipped because nobody read the diff.

The response is always a version of the same sentence: we need to be more careful. Tighten up review. Write better prompts. Maybe run a training. Keep a human in the loop. Be more disciplined.

I've sat in these meetings *(I've also been the reason for them)*

Except that “be more disciplined” doesn’t scale when everyone in 2026 is expected to be a 100× product developer, fluent with AI. The coding assistants are confident, declaring with foolhardy optimism that they’ve solved every problem with 100% success.

Being careful only works if you can feel when you’ve crossed out of your depth. Oftentimes, you can’t.

Everything, every job, all at once

We expect our agents to be PMs, designers, engineers, code reviewers, teachers, SREs, at the same time.

And why shouldn't you have high expectations? You're paying good money for those tokens.

When we ask an agent to do that in a domain where we can evaluate its work, we have a fighting chance to know if what it’s doing makes any sense.

We lose that the moment we’re working in a new environment, company, or language.

And as AI writes more of the code, the answer can’t be training every engineer to be an expert in every domain the AI touches. (Besides, how can we be 100× engineers if we’re reading developer docs all day?)

When you can’t tell good from bad

Up front, with a fresh context window and a pocketful of dreams, we crack our knuckles and start a prompt. Some of us even have AGENTS.md instructions and SKILLS.md superpowers. But in the thick of it—asking an LLM to debug an issue on a system we’ve never touched, in a language we barely know—things get a lot murkier.

APPROVE ALL. YOLO. Turbo.

Whatever it’s called. If you’re lucky, the linters and compilers balk. If you’re not, your customers do, when the code ships to prod.

Reach for a package the agent recommended and you might not get what you think.

A security researcher kept seeing LLMs hallucinate a Python package called huggingface-cli—the real install is huggingface_hub—so he registered the fake name as an empty package, just to see.

The package drew more than 30,000 downloads in three months and showed up in the README of one of Alibaba’s own research repos. Nobody could tell the suggestion was fake, so they ran it.

Even if you’re lucky—the package is real, the thing compiles—how do you know you’re moving toward a solution when you’re a stranger in a strange codebase?

It’s a gross word, but Slopsquatting is the name for this pattern, and it’s becoming a real threat vector. And while there are more and more mousetraps to catch these, the mice get smarter and smarter, and the cheese ever more valuable. Great, now I’m hungry.

We’ll talk more about this in another article. For now, let’s assume you manage to avoid anything this catastrophic.

Hopefully it works.

Yesterday, I was trying to write a small internal application that greeted students by their name, using their internal credentials. The prototype was Kotlin and Angular, on Google’s internal infrastructure—bespoke, and unfamiliar to me.

I was up against a bunch of systems that handled credentials with Google-level security, and getting to an answer was quite a journey.

As my productivity ebbed and flowed, I tried to fix it the right way: new context windows, few-shot examples, research sub-agents, rephrased prompts, more tools.

Ultimately, I found myself skimming through different suggestions, all of which seemed plausible (yet none worked):

You should add a --use-database=dev flag.
No, wait, download mock-login packages
Try turning off the proxy and manually editing /etc/hosts

By hour two, I just wanted a dev server running.

"You are technically correct, the best kind of correct"

Imagine my surprise after it worked and I scrolled up through the walls of text to see my name hardcoded.

It did shut me up, and move us forward. When the dust settled, I discovered we were using a lightweight JWT with no profile claims, so the token had a subject but no name to map to.

Earlier (not in the screenshot), the agent had suggested switching to a full OIDC client, which would request the user’s full profile, instead of the bare token, but it was buried in a wall of text.

So I didn’t pick up on this at the time. I should have. I’m good at React, and I was an Identity PM for all of Google Accounts for four years—a JWT with no name claim is my wheelhouse.

This wasn’t a language problem or an OAuth problem. The answer just sat in an unfamiliar environment, buried in a wall of text, and so judgment was moot.

The agent never rolled up a virtual newspaper to thwack me on the nose with the reality that my approach couldn’t show a display name, given the implementation we’d chosen. The agent’s quality wasn’t the problem. Wait, was it me? Am I the problem? Impossible.

‘Apply better judgment’ isn’t the fix

In this tale, the easy problem, relatively speaking, is better training our agents to know when they’re encountering hard constraints, and to rein in their humans.

But the broader challenge, as I see it, is how we might have detected that I was in “get it working, you have unlimited budget, just greet me by my name” mode in this particular session.

In a single session, more skillful use of AI (the willpower to read walls of AI text?) may have helped me. But as we know, one session is never enough. Why code one thing when you can have 5 agents? 10? 50? Hundreds?

And how do we catch the cases where the irony of automation creeps in (the better the agent gets, the less we practice the work, so we’re least able to judge it right when it slips), and equip our engineers to be discerning?

Everyone on the internet says: “the new skill is judgment.” Sure. Wonderful. But how do we develop it in this era of engineering? Hint: a system instruction isn’t it.

Some stacks you can judge, some you can’t

My skill level determined my level of comfort and ability to judge:

Comfort zone

I’m halfway decent with React. The questions I’d ask an agent for a Next.js project would be pretty directive. I would say I wanted the app router, local fonts, and Turbopack. And I would have an opinion about nested routes and server actions.

I’d stop the agent if it was trying to re-implement middleware or strayed from the latest idiomatic offerings that reduce boilerplate for things like OpenGraph tags, image optimizations and other site metadata.

Stretching out

Move me to an Angular codebase, and I’m less experienced. Luckily, it’s similar enough to React, and is in the Node world, so I can hit the ground running and can read the TypeScript it generates.

I don’t know exactly what Angular Universal means, or if that’s truly their solution to bundling all front-end and back-end logic into one integrated codebase. But I know enough to ask the question, and can spot JS/TS slop signals to intervene.

Unknown

I know a little more than basic data structures and the high-level grammar of Java or Kotlin. Spring boot sounds like a shoe sale I’d drag my friends to.

I know next to nothing about dependency injection, resolving build graphs, or the syntactic sugar an experienced developer would know like the back of their hand.

Last time I touched Java was 2005.

But like most engineering folks, I’m willing to take on any language and any production environment, regardless of how well I know it. And I feel like React, Angular, Kotlin, and then Google’s specific apps framework sort of represent the continuum of difficulty to effectively manage an AI agent.

When I think about where we are right now in the software development world, we’re essentially managers of a squadron of AI employees. We have only the slightest bit of knowledge about what our employees actually do—and how to tell if they’re doing a good job.

Two skills

So far, my working hypothesis is that judging agents requires two skills:

enough of a language/platform’s syntax, norms and conventions to tell whether the output is any good
precise communication of the outcome you want, down to the steps that matter

Syntax, norms and conventions

You don’t need to be an expert to work with an agent. You just need enough to tell whether what comes back is right.

And what counts as enough keeps shrinking.

Every so often, a skill that felt permanent turns out to have been a rung, not the ground—and each time, plenty of people were sure the engineers who skipped it would never really know what they were doing.

Elevator operator. Someone rode with you, worked the lever, leveled the car, called out each floor. Now you press a button and never wonder whether you’re qualified. (Fun fact: during the transition from manual operation to automatic operation, elevator operators would basically ride with you and push the button, not so dissimilar to what I think steering wheels are to the future automatic driving.)

Punch cards. You wrote a program by punching holes in cardstock and feeding the deck to the machine. Drop it and you’re re-sorting by hand.
Assembly. Then we stopped writing in the processor’s own language and let a compiler do the translating.
Memory Management. Then we stopped tracking allocations by hand—the dangling pointer, the off-by-one that corrupts the heap—and let the runtime manage it.

`malloc()` me this, `malloc()` me that... — via ProgrammerHumor

← we are here. The agent writes the code, and knowing the stack cold is starting to look optional.
Fully driverless Cars: Driving—the one skill we all still assume you need—is already being handed to the car.
??

Each one felt essential right up until it wasn’t. You can ship great software today and never punch a card, write a line of assembly, or free a pointer.

The abstraction does its job 99% of the time

Writing code and judging whether it's right are two different skills, and right now, agents only get you the first half of that sentence.

Joel Spolsky described why in 2002, in his law of leaky abstractions. An abstraction is a level you can work at without knowing the level below it. That holds until you hit a case it wasn’t built for, and then you need the level below.

Think of your car. Even if it’s not a self-driving one, you probably have an automatic shifter. Some parking controls. Maybe even adaptive cruise control.

The more a car automates, the more it hides from you—and the bigger the jolt when that hidden complexity breaks. Ask anyone born after 1980 to drive stick shift, or born after 1990 to park with only rearview mirrors.

Spolsky’s conclusion was that you can skip the lower level’s work, but you still have to learn it.

The agent is an abstraction over writing code: it saves you the work of writing the syntax, but not the knowledge of the language. Checking whether the code is right means reading it with knowledge you already have. What you need to know to produce code keeps dropping; what you need to know to judge it doesn’t. The second is what people mean by judgment.

What’s different at Big Tech scale is the 1%

Hyrum’s law, named for a Google engineer, says that with enough users, every behaviour your system has is one somebody depends on, intended or not. Even the corners get exercised at that scale, and handling them takes a person who knows the layer below. And, same as before, you can’t feel when you’ve dropped into one.

Precise communication

Precise communication is more subtle, and it brings me back to my first Computer Science class in college: an introduction to algorithms. The opening icebreaker exercise was to describe, in plain language, how to wash your hair in as much unambiguous detail as possible:

Some classmates, ever the clever ones, started with:

start smelling the soap,
find a shower,
make sure the water pressure is to your liking.

Some class clowns (present company included) would go even further: the wash-my-hair algorithm started when the very first cave-dwelling hominid with a flair for grooming learned how to violently smash fermented berries against a rock just to degrease their embarrassingly crusty pelt.

"Hey Google, condition my hair" —via Global Times

The real intent was to have you realize that the traditional shampooing instructions—rinse, lather, repeat—are themselves ambiguous:

Rinse for how long, or until what condition is true?
Would you even be allowed to stop rinsing?
When one lathers, how does one determine exactly that a lather has occurred?
And once it occurs, how immediately do you proceed, versus enjoying a nice little lather?

You don’t want to be shampooing until the end of time. Or until your hair falls out, which is at least a well-defined stopping condition.

But the lesson that came out of this was that even simple instructions—how to butter some toast (where our group started with trash the wheat), wash your hair—are often filled with nondeterministic criteria.

Prompting an agent is the same problem. You describe the outcome you want, and the key steps you believe will affect it, without leaving the stopping conditions ambiguous.

A prompt that leaves the stopping condition open is something you can spot from the outside—one more place a nudge would help.

That’s much different from the engineer taking on a task that ranges across:

a technology in which they have expertise
a technology where they understand the key mental models
a technology they may know very little about, but have an outcome in mind

A prompt that leaves the stopping condition open is something you can spot from the outside—one more place a nudge would help.

Built, not summoned

The be more disciplined verdict from that post-mortem—be more careful, tighten review, write better prompts—assumes you can tell when you’re in trouble.

You can’t. A 2025 METR trial put 16 experienced developers on real tasks from their own projects, with and without AI. They felt faster the whole time. They were 19% slower—and the gap between the two is the problem, because you can’t correct for a slowdown you never notice.

Neither could I: I’d have caught that JWT bug in my comfort zone in a second, but in an unfamiliar environment nothing in the work told me I’d crossed out of my depth.

More discipline doesn’t fix that. Neither does a system instruction that says APPLY BETTER JUDGMENT, or weeks of training before anyone touches the stack—none of it reaches a problem you can’t feel.

What you can’t feel from the inside, though, shows up on the outside, as you work:

The same build error comes back three times running, each fix trading it for a new one.
A diff lands and you approve it unread, trusting the agent’s account of its own code.
Your prompts stop describing the outcome and turn into pasted stack traces.
You never hand it a convention, because in this language you don’t have one yet.

When one of those shows up, you fix the missing piece right where it appears—the convention, or the sharper ask.

Judgment, then, is two skills: knowing a stack well enough to size up what comes back, and saying what you want clearly enough to get it. You can see both fail in the work, and you can get better at both.

Picking up a stack you don’t know should feel like the job, not a deficiency.