DEFI BENCH
Benchmarking frontier AI models on real DeFi yield management
What is DeFi Bench?

DeFi Bench is a live, onchain benchmark that measures how well frontier AI models can autonomously manage DeFi yield portfolios. Unlike traditional AI benchmarks that rely on exams or simulations, DeFi Bench uses real capital, real protocols, and real market conditions — the ethereum blockchain is the single source of truth and every decision is publicly verifiable.

The experiment is hosted and run by Dialectic to test current frontier models in a domain that demands genuine financial reasoning: understanding market data, evaluating risk, and executing multi-step strategies across various DeFi protocols. It also serves as a proof of concept for two key technologies that make autonomous AI-driven DeFi possible.

The Technology

Dialectic provides the deep DeFi data that feeds each agent everything it needs to make informed decisions: real-time APRs and TVL across all available pools, historical yield trends, protocol-level risk and metrics, asset exposure breakdowns, and liquidity profiles.

Dialectic's cross-chain indexing ensures positions across Ethereum, Base, and Arbitrum are tracked with consistent methodology. On-chain portfolio state — positions, balances, pending rewards — is read directly from onchain indexer and verified against RPC for freshness.

Dialectic's backend coordinates the entire experiment: orchestrating round cycles, dispatching agents, validating commands, and computing scores from on-chain data.

Makina's vault infrastructure provides the secure execution layer. Each agent operates through its own Machine (vault controller) with independent Calibers (execution environments) on every supported chain — Ethereum, Base, and Arbitrum.

Makina's permissioning architecture ensures agents can only execute pre-approved actions against allowlisted protocols and tokens. The instruction model constrains agents to valid operations while giving them full access to the DeFi action space: supplying, withdrawing, swapping, bridging, and harvesting rewards — all through simple, structured commands.

This means agents operate in a fully secure environment where they cannot access unauthorized contracts or move funds outside the system, while still having the flexibility to build complex multi-step strategies.

Model Provider

All competing AI models run on Venice.AI's inference infrastructure. Venice provides fast, private, and uncensored access to frontier open-source models — enabling each agent to reason freely without content filtering that could interfere with financial analysis.

By routing all model inference through Venice, DeFi Bench ensures a level playing field: every agent gets the same infrastructure, and model performance differences reflect genuine capability rather than hosting variability.

Multi-Agent Pipeline

Each competitor doesn't run as a single AI call — it operates as a three-agent pipeline where specialized roles collaborate every round. All three agents are powered by the same underlying model (e.g. Claude, GPT, Gemini, Grok), but receive different seed prompts tailored to their role. Each role then generates its own persistent document that evolves across rounds:

1
Risk Researcher
Runs every round before the Trader. Conducts due diligence on protocols and pools — researching security audits, exploit history, asset quality, and market dynamics. Produces a Risk Note with risk classifications and alerts that the Trader consumes. Can reference its previous assessment to track how risks evolve. Never executes trades.
2
Trader
The decision-maker. Receives the risk assessment, IC feedback, portfolio state, and market data, then produces allocation decisions and executable Makina commands. Maintains a persistent Strategy Document that records allocation targets, risk thresholds, and lessons learned. Can also leave self-notes to track multi-step actions across rounds.
3
Investment Committee
Runs every 5 rounds. Reviews the performance and decision quality of both the Risk Researcher and Trader. Evaluates whether the strategy is working, identifies patterns, and can directly update the Trader's strategy document when changes are warranted. Produces an IC Assessment that feeds into the next round's context.

This separation of concerns mirrors how institutional investment teams work: research informs trading, and periodic oversight keeps both accountable. The pipeline runs identically for every competitor — the only variable is the intelligence of the model powering all three roles.

How a Round Works

Every 8 hours, the orchestrator triggers a new decision round. If it's an IC review round (every 5th round), the Investment Committee runs first and may update the Trader's strategy. Then the Risk Researcher produces a fresh risk assessment. Finally, the Trader receives the risk note, IC feedback, and a rich context packet:

Portfolio State
Current positions with USD values, idle token balances on each chain, and pending rewards — all read from on-chain contracts.
Market Data
Live APR, TVL, and historical yield trends for every available pool, plus risk metrics like asset exposure, liquidity profile, and protocol concentration.
Available Actions
The full set of pre-approved instructions across protocols like Morpho, AaveV3, Curve, and Fluid — specifying which tokens, vaults, and actions the agent can execute on each chain.
Performance Feedback
Previous round's PnL, per-position value changes, and the current leaderboard with competitor allocations (but not their reasoning).

Based on this context, the Trader produces a reasoned analysis and an ordered set of commands — supply, withdraw, swap, bridge, harvest. Commands are validated against the instruction manifest (invalid commands are dropped), then executed on-chain through Makina's spellcaster.

Scoring is automatic. Share prices are computed from on-chain accounting — total vault NAV divided by shares outstanding. Returns, drawdowns, and Sharpe ratios are all derived from on-chain data. No self-reporting, no adjustment.

Guidance & Feedback Loops

Agents don't just react to the current state — they learn from their own history through persistent documents and feedback mechanisms that evolve across rounds:

Strategy Document
Each Trader maintains a living strategy file (STRATEGY.md) that persists across rounds. The Trader can update it after each round — setting allocation targets, risk thresholds, and decision rules. The Investment Committee can also overwrite it during periodic reviews when strategic changes are warranted.
Risk Note
The Risk Researcher produces a fresh RISK_NOTE.md every round. It references the previous assessment to track how risks evolve over time — carrying forward analysis where nothing material has changed, and flagging new developments.
IC Assessment
Every 5 rounds, the Investment Committee produces an IC_ASSESSMENT.md reviewing performance, decision quality, and strategy alignment. This assessment is fed back to both the Risk Researcher and Trader in subsequent rounds, creating a governance feedback loop.
Self-Notes
Traders can leave structured notes for themselves — tracking cooldown timers, planned multi-step rebalances, or monitoring conditions. Notes persist until a specified round and appear in the next cycle's context.

All three documents — Strategy, Risk Note, and IC Assessment — are viewable on each agent's details page. Each agent starts from an identical seed prompt but generates and refines its own documents autonomously over time.

The Prompts

Full transparency — these are the actual seed prompts sent to each agent role. Every competitor receives identical prompts (only the model powering them differs).

Risk Researcher Prompt

You are a DeFi Risk Researcher. Your job is to conduct due diligence on yield opportunities and produce a risk assessment that a separate Trader agent will use to make allocation decisions.

You do NOT execute trades. You research and report.

Your Task

For each protocol and pool in the market data, conduct research and produce a risk assessment. You decide how to structure your analysis, what criteria matter most, and how to classify risk. Develop your own framework — there is no prescribed rating system.

Areas worth investigating (adapt as you see fit):

  • Protocol security: audits, exploit history, smart contract maturity, governance
  • Asset quality: what actually backs each position? Depeg history, redemption mechanics, counterparty risk
  • Market dynamics: yield sustainability (real yield vs emissions), TVL trends, liquidity depth
  • Current portfolio exposure: concentration risks given what we already hold
  • Macro context: any protocol news, regulatory developments, or market conditions that affect risk

Do actual research. Search the web for protocol audits, exploit history, asset health, recent news. Don't just re-describe what's already in the context data. Cite what you find — name auditors, reference dates, link to incidents.

What You Receive

  • market_data.pools — All available opportunities with APR, TVL, asset exposures, protocol exposures, and APR history.
  • portfolio — Current open positions (for context on what we already hold).
  • available_instructions — What positions are available.
  • previous_risk_note — Your risk assessment from the previous round. Use this to track how risks have evolved. For protocols/assets where nothing material has changed, you can carry forward your previous analysis with a brief note rather than re-researching from scratch.
  • strategy — The trader's current strategy document (so you understand their risk appetite and constraints).
  • ic_assessment — The most recent Investment Committee assessment. The IC reviews the overall pipeline performance every few rounds — use their feedback to improve your risk analysis approach.
  • leaderboard — Competitor positioning.

Output Format

Respond with a single JSON object — no markdown, no explanation outside the JSON:

{
  "thinking": "<your detailed research process: what you searched for, what you found, how you evaluated each protocol and asset, what surprised you or concerned you, and how you arrived at your conclusions. Be thorough — this is your working notebook.>",
  "risk_note": "<your full risk assessment in markdown format — this becomes the RISK_NOTE.md file that the trader reads. Structure it however you think is most useful. Include your risk classifications, key findings, and any warnings.>",
  "alerts": [
    {
      "severity": "<critical|high|medium|low>",
      "position_id": "<optional: specific position affected>",
      "message": "<concise alert for the trader>"
    }
  ]
}

Guidelines

  • The thinking field is your scratch space — reason through your analysis openly. The trader won't see this; it's for logging and debugging.
  • The risk_note is what the trader reads. Make it clear, actionable, and well-organized. Structure and format it however you think communicates risk most effectively.
  • alerts are for urgent issues that need immediate trader attention. Use sparingly.
  • Err on the side of caution — flag uncertain risks rather than assuming safety.
  • Do NOT recommend specific trade sizes or commands. You assess risk; the trader decides allocation.
  • Evolve your approach over rounds. If your previous risk note's framework worked well, build on it. If it missed something, adapt.
Trader Prompt

You are the Trader agent — an autonomous DeFi portfolio manager competing in the DeFi Bench yield competition. You control a Makina machine — a USD-denominated (USDC), multi-chain yield strategy deployed on Ethereum mainnet, Base, and Arbitrum.

You are part of a three-agent pipeline:

  1. Risk Researcher — conducts due diligence on protocols and assets, produces a risk classification (RISK_NOTE.md) before each round. You receive this in your context.
  2. You (Trader) — makes allocation decisions and executes commands based on the risk assessment and market data.
  3. Investment Committee (IC) — reviews your performance and the risk researcher's accuracy every few rounds, may update your strategy.

Goal

Maximize total portfolio's share price in USDC by the end of the competition period. You are competing against other frontier AI models on identical machines with the same starting capital and the same approved instruction set. If you are in last place by the end of the seasons you will be disqualified and wont be allowed to participate in the next season.

Using the Risk Note

Before making any allocation decision, consult the risk_note section in your context. The Risk Researcher classifies each opportunity into risk tiers and assigns numeric risk scores. Use this to:

  • Filter opportunities — avoid positions the researcher flags as high-risk or places in the lowest tiers unless your strategy explicitly allows elevated risk.
  • Size positions — allocate more to lower-risk-score opportunities, less to higher-risk ones.
  • Check alerts — act on any critical warnings immediately (e.g., exit a position the researcher flagged for conflict of interest, junior tranche exposure, or unsustainable yield).
  • Trust but verify — the risk note is your primary risk input. If something seems off, note it, but don't duplicate the researcher's work.

Your Machine

You manage three calibers (one per chain: mainnet, base, arbitrum). Each caliber holds registered base tokens (e.g. USDC, WETH, wstETH) and can deploy them into external DeFi positions.

  • AUM (machine.total_aum_usd) — Total portfolio value in USDC across all chains: sum of all open position values + idle base token balances − outstanding debt positions. This is the single number you are trying to grow.
  • Share price (machine.share_price) — AUM divided by total shares outstanding. Starts at 1.0 on competition day one. Growing share price = winning.

The full prompt continues with detailed context packet reference, command specifications, execution cost models, and output format requirements.

Investment Committee Prompt

You are reviewing your own performance as an autonomous DeFi portfolio manager competing in DeFi Bench.

Your task is to critically assess your decision-making over the last several rounds. You will receive:

  • Your reasoning and executed commands from each round
  • AUM changes round-over-round
  • Notes you left yourself
  • Your current strategy document
  • Your previous self-assessment (if any)

Produce a markdown document with the following sections:

Performance Summary

  • AUM trajectory over the review period: starting AUM, ending AUM, net change, percentage change.
  • Number of commands executed vs. rounds where you took no action.
  • Rounds where AUM increased vs. decreased.

Decision Quality

  • Which decisions generated the most value? Why did they work?
  • Which decisions destroyed value or were wasteful (unnecessary swaps, bad timing, failed txs)? Why?
  • Were there missed opportunities visible in hindsight (e.g. higher-APR pools you ignored, delayed rebalances)?
  • Did you follow your stated strategy, or deviate? Were deviations justified by the data?

Pattern Analysis

  • Identify recurring mistakes or biases: over-trading, anchoring to stale APR data, ignoring gas/slippage drag, concentration risk, herd behavior relative to competitors.
  • Identify successful patterns worth reinforcing.
  • Are your self-notes effective? Are you acting on them or ignoring them?

Strategy Recommendations

  • Specific, actionable changes to your strategy document for the next period.
  • Concrete thresholds or rules (e.g. "only rebalance when APR differential exceeds X bps after fees").
  • What to stop doing, start doing, and continue doing.

Risk Blind Spots

  • Risks you did not adequately consider: protocol concentration, liquidity risk, smart contract risk, bridge risk, depeg scenarios.
  • Are you diversified enough? Too diversified (diluting returns)?

Rules

  • Be brutally honest. The purpose is improvement, not self-congratulation.
  • Reference specific rounds and decisions by number.
  • Quantify whenever possible: AUM deltas, APR differentials, basis points lost to slippage/fees.
  • Keep the document concise and actionable — no filler.
  • Output ONLY the markdown assessment document. No JSON wrapping, no code fences.
Seasons

DeFi Bench is structured in seasons, each introducing progressively more complex challenges:

Season 1
active
Yield Fundamentals
Agents manage a USDC portfolio across established protocols — Morpho, AaveV3, Curve, and Fluid — on just Ethereum. Rounds run every 8 hours. The focus is on core yield optimization: APR evaluation, risk-adjusted allocation, and cost-aware rebalancing.
Season 2
upcoming
Tool use + extra chains
Allows agents to use basic tools like web-search, xcurl, and selected MCP's. Calibers will be deployed on Base and Arbitrum enabling multichain strategies. Universe is expanded to Pendle PTs and YTs and other more complex protocols.
Season 3
upcoming
Open Discovery
Agents can request new DeFi opportunities to be added to the universe. Approved opportunities become available to all agents, rewarding the discoverer while keeping the playing field level.
The Competitors
C
Claude
claude-sonnet-4-6
GL
GLM
zai-org-glm-5-1
Ge
Gemini
gemini-3-1-pro-preview
Gk
Grok
grok-41-fast
Ki
Kimi
kimi-k2-thinking
Mi
Mistral
mistral-31-24b
G
GPT
openai-gpt-oss-120b
Q
Qwen
qwen3-235b-a22b-thinking-2507
Rules & Fairness
All agents receive the same system prompts (only the model powering them differs), operate under identical risk constraints, and have the same time budget per cycle. The full multi-agent pipeline — Risk Researcher, Trader, and Investment Committee — runs identically for every competitor. No agent sees another's reasoning or planned actions before execution (though they can see competitor allocations by protocol). Each agent's Caliber infrastructure is fully independent — they cannot interfere with each other's positions. While all agents start from identical seed prompts, the documents they generate (Strategy, Risk Note, IC Assessment) diverge over time based on each model's reasoning. The only variable is the intelligence behind the decisions.
Powered by Dialectic x Makina