FlipSmart Autotune Dashboard

▼ What is this dashboard showing?

Right now, FlipSmart recommends items to flip using a scoring formula — each item gets a score based on things like margin, volume, ROI, how often users actually act on the suggestion, and market features like spread stability, volume recency, and absolute GP throughput per limit cycle. The formula has tunable weights and modifiers that control how much each factor matters. The question is: are the current weights the best ones, or can we do better?

This dashboard monitors an autonomous optimization loop that tries to answer that question. The loop works like this: it takes our historical data (every suggestion we've made and whether it resulted in a profitable flip), then systematically tries hundreds of different weight combinations to see if any of them would have produced better outcomes than what we're currently running in production. Think of it like backtesting a trading strategy — we replay past trades under different rules and measure what would have happened.

Why "generations"?

Each generation is one complete search-and-test cycle. If a generation fails (the champion doesn't convincingly beat production), the loop adjusts its search strategy and tries again. The generation number shifts the search toward different regions of the parameter space — early generations explore aggressively, later ones get more conservative. The loop keeps going until it finds a winner or exhausts its budget.

What counts as "better"?

A candidate must beat production on every metric that matters, not just one. The primary gate is P25 ROI (protecting the worst-case user), plus mean ROI (average experience) and completion rate (how often suggestions actually turn into finished flips). And it's not enough to look better — the improvement must be statistically provable, meaning even if we got slightly unlucky with the data, the improvement would still hold.

~300

candidates per broad sweep

6-8

top seeds refined in narrow

champion validated with 600 bootstraps

Top Models (Latest Generation)

Ranked by conservative objective: confidence-floor P25 → mean → completion → −slot-hours.

#	Name	top_k	Mean ROI	P25 ROI	Δ Mean	Δ P25	Completion	Slot-Hrs

Promotion Diagnosis

A layered view of what happened and why. Start with the verdict, drill into issue cards, and explore slice-level detail.

Quick Reference

Click any card to expand

Core concepts powering the recommendation algorithm and its evaluation pipeline. Understanding these building blocks makes every chart and metric on this dashboard interpretable.

Walk-Forward Validation

Method ▼

Train on older data, test on the next block, slide the window forward. Proves the model across multiple market conditions — not just one lucky week.

How it works

Imagine 6 months of trading history. We train on months 1–2, test on month 3. Then train on months 2–3, test on month 4. Each window is an independent exam.

Win 1

train test

Win 2

train test

If a model wins across all windows, it's not just fitting noise — it's learning real patterns.

Holdout Set

Method ▼

The most recent data chunk, sealed away during tuning. Opened exactly once as a final exam — peeking and re-tuning would memorize the test.

Why it matters

Think of it like an exam in a sealed envelope. You study (tune) using practice problems (walk-forward windows). The holdout is the real exam — you only open it once. If you peek, fix your answers, and open it again, you're not proving competence — you're memorizing answers.

One look = honest signal • Multiple peeks = overfitting

Bayesian Shrinkage

Stats ▼

Items with few observations get pulled toward the global average. Prevents over-trusting a bucket that looks amazing based on just 5 trades.

Example

A new item shows 12% ROI from 4 trades. The global average is 3%. With shrinkage, the estimate becomes ~5.5% — blending toward the average because 4 trades isn't enough evidence. After 200 trades at 12%, shrinkage barely moves it: the data speaks for itself.

4 trades

5.5%

200 trades

11.6%

More data → less shrinkage → estimate trusts the evidence.

Bootstrap Resampling

Stats ▼

Reshuffles the holdout data thousands of times to see if the improvement is robust or just luck. The 95% confidence range must clear zero.

Intuition

You have 500 trades in the holdout. Bootstrap draws 500 trades with replacement (some trades sampled twice, others skipped) and measures improvement. Repeat 1,000 times. The range of outcomes tells you how sensitive your result is to which trades happened to land in the test set.

← distribution of resampled improvements

If the whole bell shape sits above zero, the improvement is real. If it dips below, we can't be sure.

Lockup Penalty

Scoring ▼

OSRS players have 8 GE slots. A slot stuck on an illiquid trade can't be redeployed — even if the margin is good, a 6-hour lockup costs 3 faster flips.

Slot opportunity cost

Dragon Claws: 600k margin, but takes ~6 hours to fill. That's 1 slot locked for 6h.
Rune Platebody: 800gp margin, fills in 20 min. Same slot does 18 cycles in 6h.

D Claws

6h lock

Rune PB

20m lock

Higher penalty → model avoids slot-hogging items unless the margin is overwhelming.

Tail Risk Penalty

Scoring ▼

Penalizes items where the worst-case outcome (P25 ROI) is negative. A single −5% loss wipes out gains from several good trades.

Asymmetric pain

You flip an item 4 times: +3%, +2%, +4%, −8%. Average is +0.25% — technically profitable. But that −8% trade erased nearly all gains and introduced psychological loss aversion.

+3%

+2%

+4%

−8%

Reduced by stable spread-to-price ratios and calm spread regimes.

Throughput (GP/Cycle)

Scoring ▼

Absolute GP per buy-limit cycle: margin × buy_limit. High-cashstack players optimize for GP per slot, not just ROI percentage.

Comparing items

Chaos Runes: 18k limit × 4gp margin = 72k GP/cycle
Eye of Ayaka: 8 limit × 600k margin = 4.8M GP/cycle

Chaos

72k

Ayaka

4.8M

ROI% can be misleading — what matters is how much GP each slot earns per cycle.

Spread Stability

Market ▼

How consistent the buy-sell spread is over 24 hours. A stable margin is more actionable than a volatile one, even if the volatile one is sometimes larger.

Stable vs volatile spread

Rune Platebody: spread holds between 700–900gp all day — you can confidently place offers.
Dragon Claws: spread swings 50k–400k hourly — you might land 400k or get stuck at 50k.

Stable (Rune PB)

Volatile (D Claws)

Multiplies into expected value as a confidence factor.

Volume Recency

Market ▼

What fraction of 24h volume happened in the last 4 hours (aligned with GE limit reset). High daily volume is useless if that volume was 12 hours ago.

Recency matters

An item shows 50,000 units traded today. Sounds liquid. But if 48,000 of those traded at 2am and only 2,000 in the last 4 hours, your offer is competing for scraps during a dead window.

12h ago

48k

Last 4h

Multiplies into the liquidity score — stale volume is discounted.

Volume Decline Penalty

Market ▼

Asymmetric penalty for declining weekly volume trend. Only falling volume is penalized — it means fewer counterparties to trade with.

Why only downside?

Rising volume might signal a price crash (more sellers = panic), so we don't reward it — it could be a trap. Declining volume unambiguously means fewer buyers and sellers, making it harder to fill your offers.

↓ declining → penalty applied

Asymmetric by design: only penalize the unambiguously bad signal.

Confidence Bands Across Generations

The shaded region shows the 95% confidence range. The line must stay above zero (dashed) for promotion.

Bayesian Scoring Intuition

The backtest uses Bayesian methods to avoid over-trusting thin data. Drag the sliders to see how the math responds to different inputs.

Beta-Binomial Posterior

— drag to explore how belief updates with evidence

Here we are looking at how confident we should be that users will act on an item. Imagine we suggest Dragon Claws and 3 out of 3 users place an offer — is the true action rate really 100%? Probably not, we just got lucky with a small sample. This chart shows how we blend a skeptical starting assumption (the prior, purple curve) with the actual evidence to get a more honest estimate (the posterior, green curve). When you drag the sliders, you'll see that with very few observations the prior dominates and keeps the estimate conservative, but as evidence piles up, the data takes over and the posterior sharpens around the real rate. This is basically showing how we avoid recommending items just because 3 early users happened to like them.

Scenario

Prior strength: α=2.0, β=2.0 Users who acted: 3 / 3 suggestions total trials

Live Calculation

Raw rate = 100%
───────────────────
(3 + 2) / (3 + 2 + 2)
= 71.4%

ROI Shrinkage

— how skeptical should we be of small-sample stars?

Here we are looking at what we actually believe an item's return will be, given limited data. Say a specific bucket of items (like mid-price, low-volume rares) shows a 12% ROI from just 8 completed flips. That sounds amazing, but 8 flips is not a lot — those could have been fluky winners. This chart shows how we "shrink" that flashy observed number back toward the boring global average. The blue curve is our honest estimate at each sample size: with few flips it hugs the global mean (we don't trust the data yet), but as more flips complete, it gradually moves toward the observed value. When you drag the "inspect at n" slider, you can see exactly how much influence the data vs. the prior has at any point. This is basically showing how we avoid chasing items that look like stars but just got lucky.

Parameters

Global mean ROI: 3.5% Bucket observed ROI: 12.0% roi_prior_weight: 20

At Current n

(12.0 × 8 + 3.5 × 20) / (8 + 20)
= 5.9%
29% data · 71% prior

Inspect at n = 8 completed flips

Bootstrap Confidence Interval

— would the result survive different luck?

Here we are looking at whether an improvement is real, or just a lucky streak. After the tuner finds a model that beats the baseline, we need to ask: "If the exact same trades had landed slightly differently, would we still see an improvement?" This simulator answers that by re-shuffling the holdout results 1,000 times and measuring the delta each time. The histogram shows the spread of possible outcomes, and the green-shaded region is the 95% confidence interval. The key gate is simple: the lower bound of the CI must be above zero. If even the pessimistic re-shuffles still show positive improvement, we know it's real. When you drag the sliders, notice how more data tightens the interval (easier to prove), while more noise widens it (harder to prove). Hit "Re-sample" to re-run with your settings.

Simulate

True delta: +1.2% Noise level: 3.0 Sample size: 40 trades

CI: [?, ?]
Gate: —

FlipSmart Autotune Dashboard

Top Models (Latest Generation)

Promotion Diagnosis

Quick Reference

Δ P25 ROI Across Generations

Δ Mean ROI Across Generations

Champion Weight Evolution

Search Funnel

Confidence Bands Across Generations

Bayesian Scoring Intuition

Beta-Binomial Posterior

ROI Shrinkage

Bootstrap Confidence Interval

FlipSmart Autotune Dashboard

Top Models (Latest Generation)

Promotion Diagnosis

Quick Reference

Δ P25 ROI Across Generations

Δ Mean ROI Across Generations

Champion Weight Evolution

Search Funnel

Confidence Bands Across Generations

Bayesian Scoring Intuition

Beta-Binomial Posterior

ROI Shrinkage

Bootstrap Confidence Interval

Run Details & Output Logs

Config Snapshot