Unpacking Anthropic's AI Fluency Index — The Better AI Gets, the Less You Question It
Better outputs don't mean less to verify. They just feel that way.
The Report
In February 2026, Anthropic published the AI Fluency Index — an analysis of 9,830 multi-turn Claude.ai conversations from a single week in January 2026. The question wasn't whether people use AI. It was whether they use it well.
"Well" was defined through 11 observable behavioral indicators: iterating on responses, questioning AI reasoning, flagging missing context — and setting collaboration conditions upfront. Each conversation was measured for which appeared.
Two patterns came out. Read separately, they look like findings about user behavior. Read together, they show users running verification and model quality as a single process — when they're not.
Finding 1: Longer Conversations Produce More Verification
85.7% of conversations showed iteration and refinement — not accepting AI's first response, but pushing back, redirecting, asking follow-up questions.
Those conversations scored higher on every other indicator:
Figures from the report. With iteration n=8,424. Without n=1,406.
5.6x more likely to question AI's reasoning. 4x more likely to flag missing context. The longer the conversation runs, the more carefully people examine the output.
People are using conversation length as a proxy for how much to verify. Long conversation → verify more. Short conversation → verify less.
Finding 2: Polished Outputs Produce Less Verification
12.3% of conversations produced a concrete artifact — code, a document, an app, an interactive tool. (The other 87.7% presumably just talked.)
The report tracks three categories of behavior:
When an artifact was the goal, direction-setting shot up — people front-loaded their instructions. Then the artifact appeared. And evaluation behaviors dropped:
Figures from the report. Artifact conversations n=1,209. Non-artifact n=8,621.
More careful going in. Less critical coming out.
People are using output appearance as a proxy for whether to verify. Something unfinished triggers "wait, is this right?" Something polished doesn't fire the same reflex.
The Wrong Proxy
Two different signals — conversation length, output quality — are being used to answer the same question: do I need to verify this?
Neither is valid. How long a conversation ran has no bearing on whether the output is correct. How polished an artifact looks has no bearing on whether it's accurate. Whether something needs verification is determined by the stakes of the decision and the known error rate of the model — not by how the conversation felt or how finished the result looks.
Verification and model performance are two independent processes. But users are running them as one: output quality feeds into the verification decision. And as models improve, the proxy becomes more misleading. Better-looking outputs trigger less verification, even as the stakes of those decisions grow larger.
30% Built the Check In. 70% Didn't.
"Interaction mode setting" — telling AI things like "push back if my assumptions are wrong" or "flag anything you're uncertain about" — appeared in only 30% of conversations.
These are the people who built the check into the instructions before work began — regardless of how the conversation went or what the output looked like. They didn't wait for signals. The other 70% kept verification reactive. When the signals said "looks fine," the check didn't happen.
What the Data Does and Doesn't Say
What the data says:
Users are treating verification as a response to output signals rather than as an independent process. Longer conversations produce more critical behavior — 5.6x more likely to question AI reasoning, 4x more likely to flag missing context. Polished artifacts produce less — evaluation behaviors down 3–5pp across the board. Only 30% set verification conditions upfront, independent of how the output looks.
What the data doesn't say:
This is a one-week snapshot from January 2026 — change over time isn't captured. The drop in evaluation during artifact generation may partly reflect verification happening outside the conversation. And Claude.ai users skew toward early adopters; the general population's numbers would likely look worse.
Why This Gets Worse, Not Better
Using quality signals as safety signals is a normal human heuristic. An expensive restaurant probably has clean kitchens. A well-designed product probably went through QA. In most domains, the appearance of quality correlates with the reality of quality — because producers who care about one tend to care about the other.
AI breaks this in a specific way. A model can produce a beautifully structured, confidently written, internally consistent output and still be wrong about the thing that matters. The surface quality and the factual accuracy are produced by separate mechanisms. One is about fluency and coherence. The other is about whether the training data, the context, and the reasoning chain happened to land correctly for this particular query. They don't move together.
This wouldn't be a structural problem if models stayed roughly the same. But they're improving fast — and the improvement shows up in exactly the signals people are using as proxies. Outputs look more polished. First responses get sharper. The conversation feels resolved sooner.
The need for verification doesn't decrease. The feeling that it's needed does. That gap is what the data is measuring — and it widens as models get better.