Repeatable Validation
Device Benchmarking
The maintained workflow for running the same detector benchmark on multiple devices and comparing both output parity and practical throughput.
Tooling
Main entry point:
software/client/scripts/device_detector_benchmark.py
Supported commands:
list-presetsbundleruncomparereport
Core idea
We benchmark from a portable bundle, not from ad hoc local files.
A benchmark bundle contains:
- a fixed image set
- manifest metadata
- canonical exported model artifacts
- source run metadata
That gives us one reproducible input for every device.
Standard workflow
1. Build the bundle on the dev machine
From software/client:
uv run python scripts/device_detector_benchmark.py list-presets
uv run python scripts/device_detector_benchmark.py bundle \
--preset chamber_zone_pair \
--output blob/device_benchmarks/chamber_zone_pair_bundle \
--archive
The current standard preset is chamber_zone_pair. It packages the current chamber-zone YOLO11s and NanoDet pair together with the shared 50-image benchmark subset.
2. Run the bundle on the target device
Example CPU run:
uv run python scripts/device_detector_benchmark.py run \
--bundle blob/device_benchmarks/chamber_zone_pair_bundle \
--output-dir blob/device_benchmarks/pi5_aihat_cpu_ort123 \
--runtime onnx \
--tag pi5-aihat-cpu-ort123
Example Mac CoreML run:
uv run python scripts/device_detector_benchmark.py run \
--bundle blob/device_benchmarks/chamber_zone_pair_bundle \
--output-dir blob/device_benchmarks/local_m4_coreml_20260406 \
--runtime coreml \
--tag local-m4-coreml-20260406
Example Hailo run:
uv run python scripts/device_detector_benchmark.py run \
--bundle blob/device_benchmarks/chamber_zone_pair_bundle \
--output-dir blob/device_benchmarks/spencer_pi5_hailo \
--runtime hailo
Example Orange Pi RKNN run:
uv run python scripts/device_detector_benchmark.py run \
--bundle blob/device_benchmarks/chamber_zone_pair_bundle \
--output-dir blob/device_benchmarks/orangepi_npu_rknn_20260406 \
--runtime rknn \
--rknn-model 20260331-zone-classification-chamber-yolo11s=/path/to/yolo11s.rknn \
--rknn-model 20260331-zone-classification-chamber-nanodet=/path/to/nanodet.rknn
Compare two result sets
Use compare when you want a machine-readable quality diff:
uv run python scripts/device_detector_benchmark.py compare \
--left-results-dir blob/device_benchmarks/local_m4_cpu_20260406 \
--right-results-dir blob/device_benchmarks/spencer_pi5_hailo \
--output blob/device_benchmarks/local_cpu_vs_spencer_pi_hailo_20260406.json
This is the main parity check before trusting a new target path.
Render a human-friendly report
Use report when you want a visual HTML comparison from benchmark JSONs:
uv run python scripts/device_detector_benchmark.py report \
--bundle blob/device_benchmarks/chamber_zone_pair_bundle \
--results-dir blob/device_benchmarks \
--output blob/device_benchmarks/chamber_zone_pair_report.html
We also maintain two higher-level renderers:
software/client/scripts/render_platform_matrix_report.pysoftware/client/scripts/render_legacy_parallel_matrix_report.py
Those are summary outputs, not the source of truth. The source of truth is still:
- bundle
- result JSONs
- compare JSONs
Runtime notes
ONNX
Use this as the main correctness path on CPU targets.
Validated:
- Mac Mini M4 CPU
- Orange Pi CPU
- Raspberry Pi 5 CPU with
onnxruntime==1.23.2
CoreML
This path currently means ONNX Runtime with CoreMLExecutionProvider.
We are not maintaining a separate exported CoreML model package yet.
NCNN
NCNN support exists in the benchmark tooling, but the current chamber-zone exports are not quality-approved on the tested Pi path.
Treat NCNN as experimental until parity is improved.
RKNN
RKNN support exists in the benchmark tooling, but current quality depends entirely on whether the .rknn files were rebuilt from the exact current ONNX exports.
The current Orange Pi RKNN artifacts should be treated as experimental.
Hailo
Hailo runs need compiled HEF files. The compile workflow is documented in Hailo HEF Workflow.
Artifact policy
To avoid accumulating random JSON and HTML files:
- Keep one canonical benchmark bundle per detector set.
- Keep one canonical result directory per device, runtime, and date.
- Keep compare JSONs for approved decision points.
- Keep only the summary HTML reports that still matter.
- Delete or regenerate exploratory reports and logs freely.
The current canonical set is listed in Runtime Status.