Inference now consumes the majority of compute spend at most AI teams running models in production, and visibility into where that time and money goes has lagged badly behind the tooling built for training. Graphsignal, an open-source inference profiling platform, is designed to close that gap, offering continuous profiling across the full serving stack with overhead low enough to run against live traffic.
The tool wraps any inference workload with a single command. Engineers prefix their launch command with graphsignal-run, point it at a running inference engine such as vLLM or SGLang, and the profiler begins collecting operation durations, resource utilization, token throughput, and latency breakdowns at the per-step level. CUDA kernel activity is captured through NVIDIA’s CUPTI instrumentation layer using low-overhead APIs, and all analysis runs in a sidecar process so the main workload is not blocked.
That architectural choice matters in production. Most profiling tools impose enough overhead that teams only run them episodically, getting a snapshot rather than a stream. Continuous high-resolution timelines tell a different story: they expose variance, reveal cold-start penalties, and surface hardware-specific bottlenecks that only appear under sustained load.
Graphsignal covers GPU and CPU metrics alongside the inference engine layer, which means engineers can correlate application-level latency with device-level behavior without stitching together multiple observability tools. Error monitoring for device-level failures is included, closing the loop between infrastructure health and inference quality.
The privacy posture is worth noting. Graphsignal records telemetry about how inference runs, not what it processes. Prompts and completions are not captured. For teams running models on sensitive data, that distinction separates a tool they can deploy in production from one that sits in staging. The profiler connects outbound only to the Graphsignal API; inbound connections are not supported.
Where does this sit relative to adjacent tooling? Tracing libraries such as OpenTelemetry and LLM-specific frameworks like LangSmith capture what a model produces and how calls chain together. Eval frameworks assess output quality. Graphsignal occupies the layer below both: it tells you how the inference hardware and engine are actually behaving, independent of what the model outputs. Teams running multiple model variants or multiple inference engines benefit most, because the tool spans heterogeneous stacks rather than locking to a single framework.
The coding-agent integration is a practical addition. Graphsignal exposes a skill that lets Claude Code, Codex, or Gemini fetch profiling signal context directly during a session, so an agent investigating a latency regression can pull the relevant profiling data without a human intermediary. That workflow becomes more valuable as teams automate inference optimization rather than doing it on an ad-hoc basis.
The release includes integration documentation for PyTorch, vLLM, and SGLang. Installation uses uv or pip with CUDA 12.x or 13.x variants. The project is available on GitHub under the graphsignal organization.
Teams running inference at scale in the next quarter should evaluate whether their current observability stack covers hardware-level behavior or only application-level metrics. If the answer is only application, a gap exists that becomes more expensive as serving volume grows.
Source: Graphsignal, open-source project documentation published at github.com/graphsignal/graphsignal-profiler.