FlipSmart Autotune Dashboard

Loading optimization data...
Auto-refreshes every 10s (heartbeat ages every second)
What is this dashboard showing?

Right now, FlipSmart recommends items to flip using a scoring formula — each item gets a score based on things like margin, volume, ROI, how often users actually act on the suggestion, and market features like spread stability, volume recency, and absolute GP throughput per limit cycle. The formula has tunable weights and modifiers that control how much each factor matters. The question is: are the current weights the best ones, or can we do better?

This dashboard monitors an autonomous optimization loop that tries to answer that question. The loop works like this: it takes our historical data (every suggestion we've made and whether it resulted in a profitable flip), then systematically tries hundreds of different weight combinations to see if any of them would have produced better outcomes than what we're currently running in production. Think of it like backtesting a trading strategy — we replay past trades under different rules and measure what would have happened.

Why "generations"?
Each generation is one complete search-and-test cycle. If a generation fails (the champion doesn't convincingly beat production), the loop adjusts its search strategy and tries again. The generation number shifts the search toward different regions of the parameter space — early generations explore aggressively, later ones get more conservative. The loop keeps going until it finds a winner or exhausts its budget.
What counts as "better"?
A candidate must beat production on every metric that matters, not just one. The primary gate is P25 ROI (protecting the worst-case user), plus mean ROI (average experience) and completion rate (how often suggestions actually turn into finished flips). And it's not enough to look better — the improvement must be statistically provable, meaning even if we got slightly unlucky with the data, the improvement would still hold.
~300
candidates per broad sweep
6-8
top seeds refined in narrow
1
champion validated with 600 bootstraps

Top Models (Latest Generation)

Ranked by conservative objective: confidence-floor P25 → mean → completion → −slot-hours.
#Name top_k Mean ROI P25 ROI Δ MeanΔ P25 Completion Slot-Hrs

Quick Reference

Click any card to expand
Core concepts powering the recommendation algorithm and its evaluation pipeline. Understanding these building blocks makes every chart and metric on this dashboard interpretable.
WF
Walk-Forward Validation
Method
Train on older data, test on the next block, slide the window forward. Proves the model across multiple market conditions — not just one lucky week.
How it works
Imagine 6 months of trading history. We train on months 1–2, test on month 3. Then train on months 2–3, test on month 4. Each window is an independent exam.
Win 1
train test
Win 2
train test
If a model wins across all windows, it's not just fitting noise — it's learning real patterns.
HO
Holdout Set
Method
The most recent data chunk, sealed away during tuning. Opened exactly once as a final exam — peeking and re-tuning would memorize the test.
Why it matters
Think of it like an exam in a sealed envelope. You study (tune) using practice problems (walk-forward windows). The holdout is the real exam — you only open it once. If you peek, fix your answers, and open it again, you're not proving competence — you're memorizing answers.
One look = honest signal • Multiple peeks = overfitting
BS
Bayesian Shrinkage
Stats
Items with few observations get pulled toward the global average. Prevents over-trusting a bucket that looks amazing based on just 5 trades.
Example
A new item shows 12% ROI from 4 trades. The global average is 3%. With shrinkage, the estimate becomes ~5.5% — blending toward the average because 4 trades isn't enough evidence. After 200 trades at 12%, shrinkage barely moves it: the data speaks for itself.
4 trades
5.5%
200 trades
11.6%
More data → less shrinkage → estimate trusts the evidence.
BR
Bootstrap Resampling
Stats
Reshuffles the holdout data thousands of times to see if the improvement is robust or just luck. The 95% confidence range must clear zero.
Intuition
You have 500 trades in the holdout. Bootstrap draws 500 trades with replacement (some trades sampled twice, others skipped) and measures improvement. Repeat 1,000 times. The range of outcomes tells you how sensitive your result is to which trades happened to land in the test set.
← distribution of resampled improvements
If the whole bell shape sits above zero, the improvement is real. If it dips below, we can't be sure.
LP
Lockup Penalty
Scoring
OSRS players have 8 GE slots. A slot stuck on an illiquid trade can't be redeployed — even if the margin is good, a 6-hour lockup costs 3 faster flips.
Slot opportunity cost
Dragon Claws: 600k margin, but takes ~6 hours to fill. That's 1 slot locked for 6h.
Rune Platebody: 800gp margin, fills in 20 min. Same slot does 18 cycles in 6h.
D Claws
6h lock
Rune PB
20m lock
Higher penalty → model avoids slot-hogging items unless the margin is overwhelming.
TR
Tail Risk Penalty
Scoring
Penalizes items where the worst-case outcome (P25 ROI) is negative. A single −5% loss wipes out gains from several good trades.
Asymmetric pain
You flip an item 4 times: +3%, +2%, +4%, −8%. Average is +0.25% — technically profitable. But that −8% trade erased nearly all gains and introduced psychological loss aversion.
+3%
+2%
+4%
−8%
Reduced by stable spread-to-price ratios and calm spread regimes.
TP
Throughput (GP/Cycle)
Scoring
Absolute GP per buy-limit cycle: margin × buy_limit. High-cashstack players optimize for GP per slot, not just ROI percentage.
Comparing items
Chaos Runes: 18k limit × 4gp margin = 72k GP/cycle
Eye of Ayaka: 8 limit × 600k margin = 4.8M GP/cycle
Chaos
72k
Ayaka
4.8M
ROI% can be misleading — what matters is how much GP each slot earns per cycle.
SS
Spread Stability
Market
How consistent the buy-sell spread is over 24 hours. A stable margin is more actionable than a volatile one, even if the volatile one is sometimes larger.
Stable vs volatile spread
Rune Platebody: spread holds between 700–900gp all day — you can confidently place offers.
Dragon Claws: spread swings 50k–400k hourly — you might land 400k or get stuck at 50k.
Stable (Rune PB)
Volatile (D Claws)
Multiplies into expected value as a confidence factor.
VR
Volume Recency
Market
What fraction of 24h volume happened in the last 4 hours (aligned with GE limit reset). High daily volume is useless if that volume was 12 hours ago.
Recency matters
An item shows 50,000 units traded today. Sounds liquid. But if 48,000 of those traded at 2am and only 2,000 in the last 4 hours, your offer is competing for scraps during a dead window.
12h ago
48k
Last 4h
2k
Multiplies into the liquidity score — stale volume is discounted.
VD
Volume Decline Penalty
Market
Asymmetric penalty for declining weekly volume trend. Only falling volume is penalized — it means fewer counterparties to trade with.
Why only downside?
Rising volume might signal a price crash (more sellers = panic), so we don't reward it — it could be a trap. Declining volume unambiguously means fewer buyers and sellers, making it harder to fill your offers.
↓ declining → penalty applied
Asymmetric by design: only penalize the unambiguously bad signal.

Δ P25 ROI Across Generations

Must be positive for promotion. Protects worst-case users.

Δ Mean ROI Across Generations

Must be ≥ 0 for promotion. Average user experience.

Champion Weight Evolution

How the winning model's scoring weights shift across generations.

Search Funnel

How many candidates were tested at each stage.

Confidence Bands Across Generations

The shaded region shows the 95% confidence range. The line must stay above zero (dashed) for promotion.

Bayesian Scoring Intuition

The backtest uses Bayesian methods to avoid over-trusting thin data. Drag the sliders to see how the math responds to different inputs.

Beta-Binomial Posterior

— drag to explore how belief updates with evidence
Here we are looking at how confident we should be that users will act on an item. Imagine we suggest Dragon Claws and 3 out of 3 users place an offer — is the true action rate really 100%? Probably not, we just got lucky with a small sample. This chart shows how we blend a skeptical starting assumption (the prior, purple curve) with the actual evidence to get a more honest estimate (the posterior, green curve). When you drag the sliders, you'll see that with very few observations the prior dominates and keeps the estimate conservative, but as evidence piles up, the data takes over and the posterior sharpens around the real rate. This is basically showing how we avoid recommending items just because 3 early users happened to like them.
Scenario
total trials
Live Calculation
Raw rate = 100%
───────────────────
(3 + 2) / (3 + 2 + 2)
= 71.4%

ROI Shrinkage

— how skeptical should we be of small-sample stars?
Here we are looking at what we actually believe an item's return will be, given limited data. Say a specific bucket of items (like mid-price, low-volume rares) shows a 12% ROI from just 8 completed flips. That sounds amazing, but 8 flips is not a lot — those could have been fluky winners. This chart shows how we "shrink" that flashy observed number back toward the boring global average. The blue curve is our honest estimate at each sample size: with few flips it hugs the global mean (we don't trust the data yet), but as more flips complete, it gradually moves toward the observed value. When you drag the "inspect at n" slider, you can see exactly how much influence the data vs. the prior has at any point. This is basically showing how we avoid chasing items that look like stars but just got lucky.
Parameters
At Current n
(12.0 × 8 + 3.5 × 20) / (8 + 20)
= 5.9%
29% data · 71% prior

Bootstrap Confidence Interval

— would the result survive different luck?
Here we are looking at whether an improvement is real, or just a lucky streak. After the tuner finds a model that beats the baseline, we need to ask: "If the exact same trades had landed slightly differently, would we still see an improvement?" This simulator answers that by re-shuffling the holdout results 1,000 times and measuring the delta each time. The histogram shows the spread of possible outcomes, and the green-shaded region is the 95% confidence interval. The key gate is simple: the lower bound of the CI must be above zero. If even the pessimistic re-shuffles still show positive improvement, we know it's real. When you drag the sliders, notice how more data tightens the interval (easier to prove), while more noise widens it (harder to prove). Hit "Re-sample" to re-run with your settings.
Simulate
CI: [?, ?]
Gate: