Krea, the creative AI platform, has published a technical report for Krea 2, a series of open-weight text-to-image foundation models released under the names K2 Raw and K2 Turbo. The release lands at a moment when the image-generation market is crowded with capable models that, by Krea’s own assessment, have converged on a narrow band of acceptable-looking output.
That convergence is the animating problem behind K2. Conventional diffusion and flow-matching models have optimized hard for reliability: sharp photorealism, stable structure, dense text rendering. The side effect is that they push toward a default aesthetic, a kind of statistical middle ground that suits production pipelines but frustrates anyone trying to search across styles, moods, and visual directions before committing to one. Krea’s technical report frames this explicitly: the goal is not a better polished output, but a more navigable output space.
Multi-stage training as the mechanism
Krea 2 is trained through five sequential stages: pretraining, midtraining, supervised finetuning (SFT), preference optimization, and reinforcement learning. Each stage is designed to push the model’s output distribution toward a specific property, with RL used as the final layer of refinement rather than a shortcut to capability. The architecture is a diffusion transformer (DiT) that borrows conventions from the LLM world, including grouped-query attention (GQA), SwiGLU layers, gated sigmoid attention for training stability, and Qwen3-VL as the text encoder. The team ran systematic ablations across every major architectural choice before settling on the final design.
Data curation built around stylistic diversity
Most image-generation pipelines filter aggressively for aesthetic quality, using models trained to surface “good” images. Krea argues this creates a problem: aesthetic-score models encode their own biases, classifying motion blur or deliberate softness as low quality when those are valid stylistic choices. K2’s pretraining data is filtered on a narrower set of criteria: duplicates, images where the captioning model fails to capture what matters, samples that introduce systematic artifacts or biases, and images generated by AI. That last exclusion is notable. Krea found that even a small proportion of synthetic images in the training mix introduces a ceiling on model quality, because synthetic images are easier to learn and pull the model’s distribution toward them. The team built in-house classifiers to detect and remove AI-generated samples.
Midtraining uses a different philosophy: instead of starting from a broad pool and filtering down, domains and sources are chosen first, then sampled to prioritize long-tail visual concepts and world-knowledge coverage. A sparse autoencoder trained on SigLIP-2 embeddings provided an unsupervised tagging system for identifying and filtering artifact patterns without requiring a dedicated classifier for each.
Prompt expander and style-reference system
Two inference-time systems address a gap Krea describes as structural to image generation: the model learns from rich, detailed captions during training, but users at inference time write shorter, vaguer, more personal prompts. The prompt expander maps underspecified inputs into richer visual directions, trained through SFT and RL to encourage variation rather than collapse to a single interpretation. The objective is not just better output quality; it is more output spread.
The style-reference system handles the case where text is insufficient for expressing visual intent. Users can supply one or more reference images and the model extracts the style or mood with minimal content leakage. Fine-grained controls adjust style strength and allow weighted mixing across multiple references.
Together, the two systems are designed to let users steer from both text and image inputs without requiring the model to choose a default.
Where K2 sits
Krea reports that K2 ranks in the top 10 on the Artificial Analysis text-to-image leaderboard and second among models from independent labs. The report does not include independent third-party evaluations of the stylistic diversity claims, which are harder to capture in standard benchmarks than photorealism or prompt adherence. That gap between the report’s framing and what benchmarks actually measure is worth tracking.
Teams building image-generation products where style control is a differentiator should evaluate K2 directly against their use cases. The open weights make that possible without API costs, and the prompt expander’s RL training objective means its behavior under diverse creative prompts will be more informative than leaderboard scores alone.
Source: Krea technical report for Krea 2 (K2), published June 2026 at krea.ai/blog/krea-2-technical-report.