Accuracy

The three surfaces

Test set

The accuracy on the held-out evaluation slice from training time. Predictions made on data the model has never seen. Cross-validated test-set accuracy lands in the high 60s for the 1-day directional model, with per-mechanism specialists scoring in the low 80s on their own protocol family.

Live, full universe

The accuracy across every pool we monitor, post-deployment, on forward-resolved data. This is the most conservative number and the right one for “how does the model actually behave in the wild”. Live full-universe sign-based accuracy currently sits in the low 50s for the 1-day model, lifted by a calibration retrain that landed in the last week.

Live, carve-out cohort, high confidence

The accuracy on the subset of pools where Path has the deepest coverage and the highest signal density, filtered to predictions where the model assigned ≥85% confidence. This is the headline number we cite to institutional partners because it is the operational range a Strategy Manager would actually trade. Carve-out high-confidence sign-based accuracy: ~59.5%, vs. the DeFiLlama random-walk baseline of 53.5%. The lift is statistically significant at p ≈ 0.0001.

Why the gap between test and live

Test accuracy and live accuracy diverge for three reasons:

Distribution shift. Live data drifts from training data over time. We mitigate with a three-day retrain cadence and a live accuracy gate that flags drift.

Per-pool heterogeneity. Some pools have richer signal coverage than others. The carve-out cohort exists to surface the operational range where coverage and signal density are strongest.

Confidence-band stratification. Calibrated models concentrate accuracy at the high-confidence end. Reporting the un-stratified mean obscures that. The high-confidence subset is the relevant one for an allocator who will only act on high-confidence calls.

Integrity layer

A continuous-loop verifier daemon re-computes every accuracy number shown on Path surfaces (admin dashboards, public docs, partner-facing exports) from the canonical SQL every 30 minutes. Drift greater than 0.5 percentage points pages our on-call and auto-files a ticket. This is the operational hygiene behind every number on this site—numbers that drift silently are how trust gets destroyed in a single bad demo.

Overview

Product

Integration

Trust

The three surfaces

Test set

Live, full universe

Live, carve-out cohort, high confidence

Why the gap between test and live

Aspirational ceiling

Integrity layer

Overview

Product

Integration

Trust

Documentation Index

​The three surfaces

Test set

Live, full universe

Live, carve-out cohort, high confidence

​Why the gap between test and live

​Aspirational ceiling

​Integrity layer

The three surfaces

Why the gap between test and live

Aspirational ceiling

Integrity layer