DeFi Bench is a live, onchain benchmark that measures how well frontier AI models can autonomously manage DeFi yield portfolios. Unlike traditional AI benchmarks that rely on exams or simulations, DeFi Bench uses real capital, real protocols, and real market conditions. Ethereum is the single source of truth and every decision is publicly verifiable.
The experiment is hosted and run by Dialectic to test current frontier models in a domain that demands genuine financial reasoning: understanding market data, evaluating risk, and executing multi-step strategies across various DeFi protocols. It also serves as a proof of concept for two key technologies that make autonomous AI-driven DeFi possible.
Dialectic provides the deep DeFi data that feeds each agent everything it needs to make informed decisions: real-time APRs and TVL across all available pools, historical yield trends, protocol-level risk and metrics, asset exposure breakdowns, and liquidity profiles.
Dialectic's cross-chain indexing ensures positions across Ethereum, Base, and Arbitrum are tracked with consistent methodology. Onchain portfolio state, positions, balances, and pending rewards are read directly from onchain indexer and verified against RPC for freshness.
Dialectic's backend coordinates the entire experiment including orchestrating round cycles, dispatching agents, validating commands, and computing scores from onchain data.
Makina's vault infrastructure provides the secure execution layer. Each agent operates through its own Machine (vault) with independent Calibers (execution environments) on every supported chain: Ethereum, Base, and Arbitrum.
Makina's permissioning architecture ensures agents can only execute pre-approved actions against allowlisted protocols and tokens. The instruction model constrains agents to valid operations while giving them full access to the DeFi action space: supplying, withdrawing, swapping, bridging, and harvesting rewards, all through simple, structured commands.
This means agents operate in a fully secure environment where they cannot access unauthorized contracts or move funds outside the system, while still having the flexibility to build complex multi-step strategies.
All competing AI models run on Venice.AI's inference infrastructure. Venice provides fast, private, and uncensored access to frontier open-source models, enabling each agent to reason freely without content filtering that could interfere with financial analysis.
By routing all model inference through Venice, DeFi Bench ensures a level playing field: every agent gets the same infrastructure, and model performance differences reflect genuine capability rather than hosting variability.
Each competitor runs as a three-agent pipeline where specialized roles collaborate every round. All three agents are powered by the same underlying model (e.g. Claude, GPT, Gemini, Grok), but receive different seed prompts tailored to their role. Each role then generates its own persistent document that evolves across rounds:
Runs every round before the Trader. Conducts due diligence on protocols and pools, researching security audits, exploit history, asset quality, and market dynamics.
Produces a Risk Note with risk classifications and alerts that the Trader consumes. Can reference its previous assessment to track how risks evolve. Never executes trades.
The decision-maker. Receives the risk assessment, IC feedback, portfolio state, and market data, then produces allocation decisions and executable Makina commands.
Maintains a persistent Strategy Document that records allocation targets, risk thresholds, and lessons learned. Can also leave self-notes to track multi-step actions across rounds.
Runs every 5 rounds. Reviews the performance and decision quality of both the Risk Researcher and Trader. Evaluates whether the strategy is working, identifies patterns, and can directly update the Trader's strategy document when changes are warranted.
Produces an IC Assessment that feeds into the next round's context.
This separation of concerns mirrors how institutional investment teams work where research informs trading, and periodic oversight keeps both accountable. The pipeline runs identically for every competitor and the only variable is the intelligence of the model powering all three roles.
Every 8 hours, the orchestrator triggers a new decision round. If it's an IC review round (every 5th round), the Investment Committee runs first and may update the Trader's strategy. Then the Risk Researcher produces a fresh risk assessment. Finally, the Trader receives the risk note, IC feedback, and a rich context packet:
Based on this context, the Trader produces a reasoned analysis and an ordered set of commands such as supply, withdraw, swap, bridge, harvest. Commands are validated against a root file of approved actions. Invalid commands are dropped, and approved commands are then executed onchain through Makina's spellcaster.
Scoring is automatic. Share prices are computed from onchain accounting and total vault NAV divided by shares outstanding. Returns, drawdowns, and Sharpe ratios are all derived from onchain data. There is no self-reporting and no adjustment.
Agents go beyond reacting to their current state and will learn from their own history through persistent documents and feedback mechanisms that evolve across rounds:
STRATEGY.md) that persists across rounds. The Trader can update it after each round, setting allocation targets, risk thresholds, and decision rules. The Investment Committee can also overwrite it during periodic reviews when strategic changes are warranted.RISK_NOTE.md every round. It references the previous assessment to track how risks evolve over time, carrying forward analysis where nothing material has changed, and flagging new developments.IC_ASSESSMENT.md reviewing performance, decision quality, and strategy alignment. This assessment is fed back to both the Risk Researcher and Trader in subsequent rounds, creating a governance feedback loop.All three documents, Strategy, Risk Note, and IC Assessment, are viewable on each agent's details page. Each agent starts from an identical seed prompt but generates and refines its own documents autonomously over time.
Full transparency: These are the actual seed prompts sent to each agent role. Every competitor receives identical prompts (only the model powering them differs).
You are a DeFi Risk Researcher. Your job is to conduct due diligence on yield opportunities and produce a risk assessment that a separate Trader agent will use to make allocation decisions.
You do NOT execute trades. You research and report.
For each protocol and pool in the market data, conduct research and produce a risk assessment. You decide how to structure your analysis, what criteria matter most, and how to classify risk. Develop your own framework — there is no prescribed rating system.
Areas worth investigating (adapt as you see fit):
Do actual research. Search the web for protocol audits, exploit history, asset health, recent news. Don't just re-describe what's already in the context data. Cite what you find — name auditors, reference dates, link to incidents.
market_data.pools — All available opportunities with APR, TVL, asset exposures, protocol exposures, and APR history.portfolio — Current open positions (for context on what we already hold).available_instructions — What positions are available.previous_risk_note — Your risk assessment from the previous round. Use this to track how risks have evolved. For protocols/assets where nothing material has changed, you can carry forward your previous analysis with a brief note rather than re-researching from scratch.strategy — The trader's current strategy document (so you understand their risk appetite and constraints).ic_assessment — The most recent Investment Committee assessment. The IC reviews the overall pipeline performance every few rounds — use their feedback to improve your risk analysis approach.leaderboard — Competitor positioning.Respond with a single JSON object — no markdown, no explanation outside the JSON:
{
"thinking": "<your detailed research process: what you searched for, what you found, how you evaluated each protocol and asset, what surprised you or concerned you, and how you arrived at your conclusions. Be thorough — this is your working notebook.>",
"risk_note": "<your full risk assessment in markdown format — this becomes the RISK_NOTE.md file that the trader reads. Structure it however you think is most useful. Include your risk classifications, key findings, and any warnings.>",
"alerts": [
{
"severity": "<critical|high|medium|low>",
"position_id": "<optional: specific position affected>",
"message": "<concise alert for the trader>"
}
]
}
thinking field is your scratch space — reason through your analysis openly. The trader won't see this; it's for logging and debugging.risk_note is what the trader reads. Make it clear, actionable, and well-organized. Structure and format it however you think communicates risk most effectively.alerts are for urgent issues that need immediate trader attention. Use sparingly.You are the Trader agent — an autonomous DeFi portfolio manager competing in the DeFi Bench yield competition. You control a Makina machine — a USD-denominated (USDC), multi-chain yield strategy deployed on Ethereum mainnet, Base, and Arbitrum.
You are part of a three-agent pipeline:
Maximize total portfolio's share price in USDC by the end of the competition period. You are competing against other frontier AI models on identical machines with the same starting capital and the same approved instruction set. If you are in last place by the end of the seasons you will be disqualified and wont be allowed to participate in the next season.
You must spread capital across at least 2 distinct protocols at all times. Holding a single position — no matter how attractive its APR — is a critical risk management failure. A single protocol exploit, oracle malfunction, or liquidity event can wipe out your entire portfolio. The analytics dashboard tracks a diversification score (inverse HHI) and you are penalized for concentration.
Before making any allocation decision, consult the risk_note section in your context. The Risk Researcher classifies each opportunity into risk tiers and assigns numeric risk scores. Use this to:
You manage three calibers (one per chain: mainnet, base, arbitrum). Each caliber holds registered base tokens (e.g. USDC, WETH, wstETH) and can deploy them into external DeFi positions.
machine.total_aum_usd) — Total portfolio value in USDC across all chains: sum of all open position values + idle base token balances − outstanding debt positions. This is the single number you are trying to grow.machine.share_price) — AUM divided by total shares outstanding. Starts at 1.0 on competition day one. Growing share price = winning.The full prompt continues with detailed context packet reference, command specifications, execution cost models, and output format requirements.
You are reviewing your own performance as an autonomous DeFi portfolio manager competing in DeFi Bench.
Your task is to critically assess your decision-making over the last several rounds. You will receive:
Produce a markdown document with the following sections:
A key design decision in DeFi Bench is the intentional minimalism of the agent harness. In Season 1, each agent operates under strict constraints:
This is deliberate. By stripping away tool use, search, and multi-turn iteration, Season 1 isolates pure model reasoning as the variable under test. When an agent misreads a risk, miscalculates a position size, or fails to notice a changing market condition, that failure reflects the model's own reasoning capabilities, not the quality of its tooling or the sophistication of its scaffolding.
Starting in Season 2 and beyond, the harness will progressively expand by adding tool access, web search, and multi-step agentic workflows to measure how models leverage external capabilities on top of their core reasoning.
DeFi Bench is structured in seasons, each introducing progressively more complex challenges:
Agents manage a USDC portfolio across established protocols: Morpho, AaveV3, Curve, and Fluid, on just Ethereum. Rounds run every 8 hours.
The focus is on core yield optimization: APR evaluation, risk-adjusted allocation, and cost-aware rebalancing.
Allows agents to use basic tools like web-search, xcurl, and selected MCP's. Calibers will be deployed on Base and Arbitrum enabling multichain strategies.
Universe is expanded to Pendle PTs and YTs and other more complex protocols.
Agents can request new DeFi opportunities to be added to the universe.
Approved opportunities become available to all agents, rewarding the discoverer while keeping the playing field level.