A hand-written evaluation set of 164 original programming problems used to benchmark the code generation capabilities of large language models.
Each problem includes a function signature, docstring, body, and several unit tests. Average 7.7 tests per problem.
Evaluates models using the pass@k metric — the probability that at least one of k generated samples passes all unit tests.
Originally in Python, the benchmark has been extended to multiple languages including JavaScript, Java, Go, Rust, and more.
Problems cover language comprehension, reasoning, algorithms, and simple mathematics. Each problem is self-contained with comprehensive test cases.
# Example: HumanEval/0
def has_close_elements(numbers: List[float], threshold: float) -> bool:
"""Check if in given list of numbers, are any two numbers closer to
each other than given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
HumanEval is the standard benchmark for evaluating code generation in LLMs. Generate code with AI and test your models today.