Qualitative analysis, the discipline of finding patterns in messy unstructured data used by researchers, product managers, ethnographers, and support teams, has a well-known scaling problem. Reading interview transcripts, mining agent logs for failure modes, synthesizing user research sessions: each of these tasks takes hours per data point and does not compress easily. AI agents appear to offer a solution. Shreya Shankar, a researcher finishing a faculty job search, ran a structured set of experiments to find out how much of that solution is real.
The central question Shankar raises on her blog is not whether agents can assist with qualitative analysis. They clearly can assist. The question is which parts they can do reliably and which parts still require a human with taste.
To test this, Shankar scraped 451 tweets replying to a question from Anthropic researcher Sholto Douglas asking when users reach for models other than Claude. She then ran six agentic conditions using Claude Sonnet through the Agent SDK, varying two things: how much grounded-theory methodology the agent was prompted to follow, and where a human could intervene. Grounded theory is the method of building a theory up from data itself, moving from open codes to clustered categories to a core narrative, rather than starting from a fixed hypothesis.
The experimental design is worth noting because it is a real test. The corpus is small enough to inspect manually, the question has a knowable answer, and Shankar had her own prior judgment about what the data should say. That combination made it possible to catch agent failures that would be invisible in a purely automated pipeline.
What she found: agents paraphrase rather than analyze. In the fully automated condition, the number of open codes generated per tweet correlated with tweet length at 0.81. Longer tweets got more codes not because they contained more distinct complaints but because the agent restated the same complaint in multiple forms. When Shankar added a feedback channel where she could write a note saying not to summarize, that correlation dropped to 0.15. The same text, the same agent, a single instruction: a ninety percent reduction in a core failure mode.
Code reuse showed the same pattern. In four of the six conditions, between 74 and 100 percent of codes were used exactly once across the entire corpus, meaning the agent effectively invented a new label for every passage instead of recognizing recurring themes. The only condition that produced meaningful reuse was a multi-agent setup where two independent coders worked from a predefined category set and then reconciled disagreements. That structure forced the kind of convergence that a human analyst builds naturally over repeated passes through the data.
The structural skepticism here is worth naming. Shankar is a researcher who works on AI-for-research tooling, so she has skin in the game on the side of “agents can do this.” Her framing is optimistic. The finding is not. Agents can execute the mechanical stages of grounded theory at speed: batch processing, initial labeling, generating a memo. They cannot, at present, do the thing that makes qualitative analysis valuable, which is deciding what is interesting versus what is just noise.
This gap maps directly onto a problem many product teams are building around without naming it. Any system that offers to “talk to your data,” whether that is customer support transcripts, sales call recordings, or user interview libraries, is running qualitative analysis at scale. The same taste-versus-mechanics split applies. The system can cluster and tag and generate a summary. Whether the clusters it surfaces are the ones that matter for a specific product decision, audience, or research frame is a judgment that current agents do not reliably make.
The parallel to LLM evaluation debates is close. The ongoing discussion about whether LLMs can reliably evaluate other LLMs runs into the same wall: scoring a response is mechanical, deciding what the right criteria are is not. In both domains, the hard part is the editorial layer, not the throughput layer.
For builders working on research or data-analysis products, Shankar’s experiments suggest the practical insertion point for human judgment is earlier than most pipelines assume. Not at the final summary review, but at the coding stage, where a single note redirecting the agent’s framing produces a materially different output. Teams that treat human review as a quality gate at the end will miss the leverage available at the beginning.
Posted by Shreya Shankar on her blog on 2026-05-22.