Current open-source code LLMs often heavily
rely on human effort to produce their code pretraining data, such as using hand-crafted rules as code quality filters.
Instead, our model-centric data pipeline leverages LLMs for scoring and filtering code data.
Below are two real-world examples illustrating cases where LLM-based filter differs traditional rule-based filters.
Left: this Python snippet is incorrectly dropped by rule-based filters due to a high numeric-character ratio.
In contrast, LLM filter identifies that the code serves a meaningful purpose: showing a Pikachu image on a LED display.
Right: this visually well-structured Python script passes rule-based filters.
However, LLM-based filter identifies logical error (
if temp < 0 and temp > 1
)
and an inconsistent docstring (it claims to return a string but actually returns
None
), accurately rejecting this low-quality code.