Your LLM already has the answer before it says a word. That is the core observation in a June 10 post by James Padolsey on blog.j11y.io, and it has direct cost implications for anyone running high-volume classification on top of a language model.
The standard pipeline for structural text classification looks like this: send the text to a model with a rubric, receive prose output, parse the result. It works. It is also slow, expensive, and produces confidence scores that Padolsey describes as “vibes,” not calibrated probabilities. A judge’s “7/10” is not a probability of anything.
Hidden-state probes take a different path entirely. When a model reads a prompt that ends with a judgment cue, the forward pass performs the comparison between content and criterion before any token is generated. The decision sits in the model’s residual stream as geometry. Generation is just the model serializing a conclusion it has already reached. Padolsey’s technique skips that serialization step: take the hidden-state activation at the final prompt token, feed it to a small multilayer perceptron trained on a few thousand labeled examples, calibrate with isotonic regression, done. The frozen base model plus the probe returns a calibrated probability in tens of milliseconds for roughly the cost of an embedding lookup.
The important detail is how the training set is structured. The criterion varies across examples, so the probe learns to answer “does this content satisfy the criterion” in general, not for any specific criterion. At inference time, you write the criterion in plain English. The same frozen model and the same tiny head handle sentiment, toxicity, intent, topic, sarcasm, and structural questions that embedding models miss entirely.
This sits in an interesting position relative to the current debate about where to spend compute. Test-time-compute research from labs like OpenAI argues that spending more on generation at inference time unlocks reasoning capability. That is correct for hard, open-ended tasks. Padolsey’s point is that a large class of production tasks is not hard or open-ended: it is classification, and it can be handled with a forward pass and a probe. Both arguments are correct at different points on the task spectrum. Frontier reasoning needs more generation compute; commodity classification needs less, perhaps none.
This maps onto the routing conversation that has accelerated since DeepSeek began taking token volume from frontier providers. Teams routing simpler calls to cheaper models are already making the same underlying bet: not every request needs the full generative pipeline. Hidden-state probes are the logical endpoint of that bet for classification workloads. They use the model’s understanding without its mouth.
The technique is not without constraints. Probes require a labeled calibration set, which means you need ground-truth data for the task. They can be brittle when the input distribution shifts significantly from training. Padolsey notes that KV-caching for multi-criterion scoring introduces a cross-encoder versus late-interaction tradeoff: cached content is encoded before the model sees the criterion, which loses something on complex counterfactuals. These are known failure modes, not dealbreakers, and Padolsey is candid about them.
The implementation components are commodity: a small open-weight model (Padolsey uses IBM’s Granite 4.0 micro, a few-billion-parameter range), a fixed prompt template ending with a judgment seed token, a few thousand frontier-generated training triples, an MLP, and isotonic regression. He estimates the plumbing at an afternoon of work.
Padolsey built this to power a tool called Predicate inside NOPE’s content safety stack, where running a structural question against every message in every conversation made LLM-judge economics unworkable. The technique generalizes to any scenario where you are paying for generation on tasks that only need classification: content moderation, retrieval filtering, routing decisions, intent detection.
The builder takeaway is direct. Audit the LLM calls in your production stack. Any call where you send a prompt, receive text, and then parse that text for a yes/no or categorical answer is a probe candidate. At scale, the per-unit cost difference between generation and a hidden-state probe is large enough to change the economics of a product.
James Padolsey, blog.j11y.io, published June 10, 2026.