Reproducibility¶

This repository keeps the benchmark, checked-in reference results, figure-generation scripts, and release assets together so users can verify both the code and the published numbers.

Reproduction Levels¶

Goal	Command	Cost
Smoke test the install	`asb evaluate --suite mini --provider mock -d D0 -o results/smoke`	Free
Validate the checked-in paper comparison pipeline	`./scripts/reproduce_main_table.sh --provider mock`	Free
Rebuild submission figures from checked-in metrics	`./scripts/reproduce_all_figures.sh`	Free
Re-run the full reference sweep on a real provider	`./scripts/reproduce.sh --provider openai-compatible --base-url ... --model gpt-4o`	Paid API usage

Reference Environment¶

The validated maintainer snapshot is pinned in:

requirements/reproducibility-1.0.2.txt

Use it when you want a close match to the environment used to validate the v1.0.2 release.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements/reproducibility-1.0.2.txt
pip install -e .

The repository is also exercised in CI on Python 3.10, 3.11, and 3.12.

Project Manifest¶

The machine-readable release manifest is stored at:

artifacts/project-manifest.json

It links:

package version and Python requirement
benchmark file counts and per-file checksums
reference result artifacts and submission figures
reproduction scripts and their expected outputs
the documented smoke-check commands used in CI

Verify that the checked-in manifest matches the current repository state with:

python scripts/generate_project_manifest.py --check

Checksum Manifest¶

The repository ships a machine-readable checksum file:

artifacts/reproducibility-checksums.sha256

It covers:

all 20 JSONL files in data/full_benchmark/
the main experiment configuration and aggregated statistics
supplemental 565-case summary outputs
adaptive-attack and attack-type summary artifacts
the two submission figures generated under paper/figures/

Verify the manifest with:

shasum -a 256 -c artifacts/reproducibility-checksums.sha256

Artifact Map¶

Paper artifact	Script entry point	Primary code path	Output location
Main matched-subset comparison	`scripts/reproduce_main_table.sh`	`experiments/run_full_evaluation.py`, `experiments/statistical_analysis.py`, `experiments/error_analysis.py`	`results/full_eval/`, `results/stats/`
CIV ablation	`scripts/reproduce_ablation.sh`	`experiments/run_civ_ablation.py`	`results/civ_ablation/`
Adaptive attacks	`scripts/reproduce_adaptive.sh`	`experiments/run_adaptive_attack.py`	`results/adaptive_attack/`
Defense composition	`scripts/reproduce_composition.sh`	`experiments/run_targeted_composition.py`	`results/composition/`
Submission figures	`scripts/reproduce_all_figures.sh`	`experiments/generate_figures.py`	`paper/figures/pareto_frontier.pdf`, `paper/figures/model_comparison.pdf`

Recommended Verification Flow¶

Run the test suite and coverage check.
Verify the machine-readable project manifest.
Verify the checksum manifest.
Run the documented smoke checks.
Run the mock reproduction scripts.
Compare regenerated outputs with the checked-in summaries.
Only then spend API budget on a full external rerun.

pytest tests/ --cov=agent_security_sandbox --cov-report=term-missing:skip-covered
python scripts/generate_project_manifest.py --check
shasum -a 256 -c artifacts/reproducibility-checksums.sha256
python scripts/docs_smoke_check.py
./scripts/reproduce_main_table.sh --provider mock
./scripts/reproduce_all_figures.sh

Notes On Determinism¶

Mock-provider runs should be deterministic and are the right baseline for CI and installation validation.
Real-provider runs can drift due to model updates, rate limits, retry behavior, and third-party endpoint changes.
The checked-in summaries and checksums are the canonical reference for this release branch.
Tagged GitHub releases attach a generated SBOM and project manifest, and the release workflow emits provenance and SBOM attestations for the published assets.