GitHub published the GitHub Multilingual Repositories Dataset on June 15, making available a repository-level metadata index of over 40 million public repositories with evidence of non-English natural-language content. The dataset is released under CC0-1.0, meaning researchers and developers can use, modify, and redistribute it without restriction.

The scarcity of domain-specific multilingual training data is a real bottleneck for anyone building code-adjacent language models outside an English-first context. General web crawls like Common Crawl capture broad natural language but miss the precise vocabulary of software collaboration: installation instructions, bug report patterns, feature request phrasing, and review comment conventions. Developer content carries its own register, and that register has been almost entirely English in the datasets most labs actually use for training and evaluation.

The dataset does not ship repository source code. It is a metadata and classification layer. For each public repository, the index records the detected language of three text artifacts: a project’s documentation file, the issue thread that drew the heaviest discussion, and the pull request with the most comments. Only the opening 150 characters of each artifact feed the detector, samples shorter than 20 characters are dropped, and any guess below 0.5 confidence is filtered out. Across all repositories, that adds up to more than 80 million classification rows.

Every text artifact gets passed through a trio of detectors running independently, with each detector attaching its own confidence figure. Two of the three (Meta’s fastText and Google’s gcld3) are widely used in the field, alongside the lingua-py library. GitHub chose not to merge those three verdicts into one consolidated label. The practical benefit is precision control: a team building a high-confidence Greek evaluation set can require all three classifiers to agree; a researcher doing exploratory work on Romance language representation needs only one. That design choice matters most for lower-resource languages, where classifier calibration diverges the most.

Each repository entry also includes standard metadata: creation timestamp, disk usage, star and fork counts, primary programming language, SPDX license identifier, and issue and PR volume. That metadata lets researchers filter by activity level or by language ecosystem before touching any classification data.

The distributions themselves are not uniform across content type. GitHub noted that Korean is the most common non-English language in issue text but ranks only fifth in READMEs. Portuguese leads the non-English README category with more than 3 million repositories. Those gaps matter for evaluation set design: a model that passes an English-focused coding benchmark may still fail on a Korean issue thread or a Portuguese project walkthrough, and this dataset gives teams the infrastructure to construct those tests without starting from scratch.

GitHub acknowledged the caveats directly. A 150-character sample is often too short to characterize a whole repository, particularly when that text contains badges, code snippets, or mixed-language content. The dataset is positioned as a discovery tool, not a ground-truth language benchmark.

The timing is not incidental. Back in 2025, Microsoft pledged through a set of European policy commitments to widen access to language data for open-source builders, and this release delivers on part of that promise. GitHub said it would present the work at a Strasbourg policy gathering on June 16, an event it is staging with the Council of Europe and Microsoft’s open innovation arm.

Teams building code-completion or documentation tools for non-English markets, or model builders constructing multilingual evaluation sets, should audit this dataset as a starting index: it is the most structured public signal currently available for locating developer-context text at the repository level.

Source: GitHub Blog, published June 15, 2026, authored by Natalie Guevara.