Seed-Coder

Benchmark performance of instruct and reasoning variants of Seed-Coder-8B.

Introduction

We are thrilled to introduce Seed-Coder, a powerful, transparent, and parameter-efficient family of open-source code models at the 8B scale, featuring base, instruct, and reasoning variants. Seed-Coder contributes to promote the evolution of open code models through the following highlights.

1

Model-centric: Seed-Coder predominantly leverages LLMs instead of hand-crafted rules for code data filtering, minimizing manual effort in pretraining data construction.

2

Transparent: We openly share detailed insights into our model-centric data pipeline, including methods for curating GitHub data, commits data, and code-related web data.

3

Powerful: Seed-Coder achieves state-of-the-art performance among open-source models of comparable size across a diverse range of coding tasks.

Seed-Coder demonstrates that, with minimal human effort, LLMs can effectively curate code training data by themselves to drastically enhance coding capabilities. 🚨 The Bitter Lesson holds true — this time in code LLMs!

Seed-Coder represents our initial step towards contributing to the open-source LLM ecosystem. We look forward to seeing Seed-Coder drive advances in code intelligence and empower broader applications in the open-source LLM community!

Code Pretraining Data Recipe

The diagram below illustrates our model-centric code data pipeline, where LLMs play a central role in filtering and curating diverse data sources — including GitHub code, GitHub commits, and code-related web data — with minimal reliance on hand-crafted rules.

Processing pipeline for pretraining data

Processing pipeline for producing the pretraining data of Seed-Coder.

LLM Filters vs. Rule-based Filters

Current open-source code LLMs often heavily rely on human effort to produce their code pretraining data, such as using hand-crafted rules as code quality filters. Instead, our model-centric data pipeline leverages LLMs for scoring and filtering code data.

1

Rule-based filters for code quality are prone to subjective biases, limited in scalability, and costly to extend and maintain across diverse programming languages.

2

LLM filters can effectively capture nuanced standards of code quality that are difficult to quantify explicitly, and provide scalable, self-contained, consistent judgement of code quality.

Below are two real-world examples illustrating cases where LLM-based filter differs traditional rule-based filters.

Left: this Python snippet is incorrectly dropped by rule-based filters due to a high numeric-character ratio. In contrast, LLM filter identifies that the code serves a meaningful purpose: showing a Pikachu image on an LED display.

Right: this visually well-structured Python script passes rule-based filters. However, LLM-based filter identifies logical error (if temp < 0 and temp > 1) and an inconsistent docstring (it claims to return a string but actually returns None), accurately rejecting this low-quality code.

Python snippet for LED display controlling.
LLM: ✅ vs. Rules: ❌
(LLM correctly accepts; Rules incorrectly reject)

Visually decent but logically flawed Python script.
LLM: ❌ vs. Rules: ✅
(LLM correctly rejects; Rules incorrectly accept)

Performance

We briefly showcase benchmark performance of Seed-Coder-8B-Instruct and Seed-Coder-8B-Reasoning as follows. More evaluation results can be found in our Technical Report.

💬 Seed-Coder-8B-Instruct

Seed-Coder-8B-Instruct demonstrates superior capabilities in solving complex software engineering tasks.

Our model achieves the best performance among ~8B models on both SWE-bench Verified and Multi-SWE-bench mini, even outperforming some much larger models. Notably, Seed-Coder excels with both pre-defined workflows (Agentless) and fully autonomous agent (OpenHands).

Performance of instruct models on SWE-bench Verified and Multi-SWE-bench mini

Performance on SWE-bench Verified and Multi-SWE-bench mini.

🧠 Seed-Coder-8B-Reasoning

Seed-Coder-8B-Reasoning strikes impressive performance on competitive programming, demonstrating that smaller LLMs can also be competent on complex reasoning tasks. Our model surpasses QwQ-32B and DeepSeek-R1 on IOI'2024, and achieves an ELO rating comparable to o1-mini on Codeforces contests.

Performance on IOI'2024.

Performance on Codeforces.

Seed-Coder : Let the Code Model Curate Data for Itself