OpenAI Benchmark

HumanEval: Measuring Functional Correctness in Code Generation

A hand-written evaluation set of 164 original programming problems used to benchmark the code generation capabilities of large language models.

Try AI Code Generation View Benchmark

What is HumanEval?

164

Hand-Written Problems

Each problem includes a function signature, docstring, body, and several unit tests. Average 7.7 tests per problem.

pass@k

Functional Correctness

Evaluates models using the pass@k metric — the probability that at least one of k generated samples passes all unit tests.

∞

Language Agnostic

Originally in Python, the benchmark has been extended to multiple languages including JavaScript, Java, Go, Rust, and more.

Benchmark Problems

Problems cover language comprehension, reasoning, algorithms, and simple mathematics. Each problem is self-contained with comprehensive test cases.

# Example: HumanEval/0
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """Check if in given list of numbers, are any two numbers closer to
    each other than given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

Get Started

HumanEval is the standard benchmark for evaluating code generation in LLMs. Generate code with AI and test your models today.

Generate Code with AI →