Hugging Face published a guide this week showing developers how to run a full vLLM inference server on its serverless Jobs infrastructure using a single terminal command, no Kubernetes cluster required. The endpoint surfaces the OpenAI-compatible API, meaning any client already pointed at OpenAI can be redirected to a self-hosted model by changing one URL and one API key.

The command itself is a thin wrapper: hf jobs run behaves like docker run executed on Hugging Face’s GPU fleet. Developers specify a hardware flavor (--flavor a10g-large for the entry tier, H200 pairs for larger workloads), pass a timeout, expose a port, and supply a standard vLLM container image. Within minutes the server URL is printed to the terminal and the endpoint accepts bearer-token requests authenticated via an HF token. The Hugging Face blog post by Quentin Gallouedec details the full flow, including a Python snippet that swaps the OpenAI client’s base_url with no other code changes.

Pricing is billed per second on hardware usage. Gallouedec notes that an A10G large instance runs at $1.50 per hour; the post links to a hf jobs hardware command for the full price list. Critically, the endpoint auto-stops at the developer-set timeout and can be cancelled explicitly to avoid idle billing. That per-second model is the key economic difference from provisioning a dedicated server or a standing Inference Endpoint: you pay for a two-hour evaluation window, not a monthly reserved instance.

The distinction between HF Jobs and Hugging Face Inference Endpoints (the platform’s managed serving product) is spelled out directly in the post. Jobs offer raw flexibility: any image, any vLLM flag, any hardware combination, billed only while running. Inference Endpoints add scale-to-zero billing during idle periods, finer access control (public, protected, or private), and the operational defaults a long-lived production service needs. The practical split is experiments and batch generation on Jobs; persistent production APIs on Endpoints. Developers who have avoided self-hosting because of standing infrastructure costs now have a middle path: the operational simplicity of managed inference at the price of a short-lived job.

The pattern also scales non-trivially. The guide walks through running Qwen3.5-122B (a 122-billion-parameter mixture-of-experts model) across two H200 GPUs using vLLM’s tensor-parallel sharding flag. That is a frontier-scale deployment achievable from a single terminal command and a credit balance. The release announcement does not include latency benchmarks or throughput comparisons against commercial inference APIs, so teams will need to run their own evals before committing to Jobs for latency-sensitive workloads.

One practical detail worth noting: the exposed endpoint is gated by HF token authentication and scoped to the job owner’s namespace. It is not a public URL. That access model is appropriate for individual evaluation but means teams needing multi-user or external access must add a proxy layer, or graduate to Inference Endpoints. The post flags this explicitly and does not oversell the Jobs approach for production use.

The broader implication is a shift in the baseline assumption about self-hosted inference. For most of the past two years, running your own model meant provisioning compute, managing containers, and absorbing fixed monthly costs even during low-usage periods. A pay-per-second serverless layer reduces that friction to an SDK install and a credit card. Teams currently evaluating whether to self-host a model for evals, red-teaming, or batch generation should benchmark their actual usage patterns against the per-hour Jobs rate before defaulting to a managed API at per-token pricing.

Announced on the Hugging Face blog on June 26, 2026, in a post by Quentin Gallouedec.