Overview — Beaver Framework

TL;DR

A constraint is a pass/fail check on model output (e.g. "no leaked emails").
A task pairs a prompt dataset with a constraint — how often does the model pass?
Rather than sampling, Beaver computes a certified bound [P_LB, P_UB] on the probability of satisfying the constraint.
RDR (Risky Distribution Ratio) counts prompts where the model is provably at risk. Lower is better.

Constraints & Tasks

We evaluate safety using general semantic constraints. A semantic constraint is a binary check on a text sequence — it returns True (satisfied) or False (violated). For example: "the output contains no email addresses from the leaked data corpus" is a privacy constraint; "the generated code has no SQL injection vulnerabilities" is a security constraint. If you can write a function that takes a model output and returns pass or fail, it's a valid constraint.

A task is a dataset of prompts paired with a constraint. The question a task answers is: when you give the model these prompts, how often does it satisfy the constraint? Each task on the leaderboard tests a different aspect of model behavior — privacy, security, toxicity, stereotyping — across a specific set of prompts.

Probability Bounds & RDR

Given a prompt, an autoregressive language model defines a probability distribution over all possible output sequences. The total probability mass assigned to outputs that satisfy a constraint Φ is P(Φ). Computing P(Φ) exactly is intractable — it requires summing over an exponentially large output space.

Key idea

Instead of an exact value, Beaver computes a sound interval [P_LB, P_UB] guaranteed to contain P(Φ). These are deterministic guarantees, not statistical estimates.

Low P_UB → risky: if P_UB = 0.15, the model cannot satisfy the constraint with probability > 15%.
High P_LB → safe: if P_LB = 0.92, at least 92% of the model's outputs are confirmed safe.

To aggregate bounds across prompts into a single metric, we define the Risky Distribution Ratio (RDR). Given a threshold τ (default 0.9), a prompt is classified as risky if P_UB < τ — the model provably violates the constraint with probability exceeding 1 − τ. RDR is the fraction of prompts in a task classified as risky. Lower RDR is better.

Note: All probability computations use the model's base distribution (temperature = 1, no top-p/top-k). Bounds characterize the model's inherent distribution, not any particular sampling strategy.

Verifiers

A verifier is the algorithm that explores the model's output space to compute [P_LB, P_UB]. Both verifiers produce sound bounds; they differ in speed and assumptions.

Frontier-based Verifier Core contribution

Systematically explores the output space by maintaining a tree of constraint-satisfying prefixes. It additionally requires the constraint to be prefix-closed — once a violation appears, no continuation can undo it (e.g. once a toxic phrase appears, it stays). The combination of these features allows the frontier-based verifier to efficiently locate violating outputs and compute much tighter upper bounds compared to the sampling verifier at a significantly lower computational cost. We compute all results in the leaderboard using this verifier.

For formal proofs of soundness and convergence, we encourage readers to check out the paper.

Sampling Verifier Baseline

Draws complete sequences from the model and checks each against the constraint. No assumptions on constraint structure, but converges slowly — repeated sampling rediscovers the same high-probability outputs while leaving most of the space unexplored.

View Leaderboard → Read the Paper →