LLM Integration for Web3 Products: A Practical Guide

A weekend demo with an LLM is easy. Shipping it to real users is not.

An LLM is a large language model. You give it text, it gives you text back. Wire one to a DeFi product and you can type a question about a position and read the answer in plain English. That part takes a Friday afternoon. The version you can put in front of paying users is most of the work, and almost none of it shows up in the demo.

The prototype is one fetch call to an API. The shippable version is everything that stops that call from embarrassing you, costing you a fortune, or moving someone's funds when it should not. This is a guide to that gap, with the Web3 parts nobody writes about.

Why the Prototype Lies to You

The prototype works because you tested it five times, with clean inputs, on a good day for the provider. It returns text, you read the text, it looks right. Ship that and here is week one:

The model returns broken JSON and your UI shows a white screen.
A few users notice that long questions cost you real money per call. Someone runs a loop.
The provider has a 20-minute outage and your whole feature is down.
The model invents a token balance that is just wrong, and now a user has seen bad financial data.

None of this is exotic. This is the baseline. A shippable integration is mostly the boring scaffolding that turns "it worked that one time" into "it behaves the same way 10,000 times a day."

Doing It Properly

Prompt design and versioning

A prompt is the instruction text you send the model. It is code. Treat it like code. The most common mistake I see: a prompt string written inline in a request handler, edited live, with no record of what changed when the output gets worse.

Pull prompts into versioned files. Give each one a version string. Log which version produced which output. When quality drops after you "just tweaked the wording," you need to know exactly what the wording was.

// prompts/summarize-position.ts
export const SUMMARIZE_POSITION = {
  version: 'v3',
  system: `You are a DeFi assistant. You explain on-chain positions in plain language.
Rules:
- Only use numbers present in the provided context. Never estimate or invent values.
- If a value is missing from context, say so explicitly.
- Never give financial advice or predict prices.`,
} as const;

The version string costs nothing and saves you a day of debugging the first time output quietly degrades.

Rate limiting and cost control

Every LLM call is a small invoice. With no limits, one motivated user, or one bug in a retry loop, can run up a bill that wrecks your month. Put a hard cap per user and a global circuit breaker in front of the model, not behind it.

Rate-limit by user identity (wallet address, session, API key) before the request reaches the provider. Set a per-request token ceiling with max_tokens so one call cannot balloon. And cache hard: if ten users ask the same question about the same protocol within a minute, that is one model call and nine cache hits.

Fallbacks: model and provider failover

One provider is one point of failure, and providers go down. Plan for failover from day one: a primary model, a secondary on a different provider, and an honest message if both are unreachable.

const PROVIDERS = [
  { name: 'anthropic', call: callClaude },
  { name: 'openai', call: callGPT },
] as const;

async function generateWithFailover(input: PromptInput): Promise<LLMResult> {
  let lastError: unknown;
  for (const provider of PROVIDERS) {
    try {
      return await provider.call(input);
    } catch (err) {
      lastError = err;
      logger.warn('provider failed, falling back', { provider: provider.name, err });
    }
  }
  // Both down, degrade honestly, don't fabricate
  throw new LLMUnavailableError('All providers unavailable', { cause: lastError });
}

The line that matters is the last one. When everything fails, you show an honest "this feature is temporarily unavailable." You never let the UI invent a nice-looking answer to fill the gap.

Structured output validation

This is the one that separates a toy from a product. Structured output means the model returns data in a fixed shape, like JSON, that your code reads. If your code reads the model's output programmatically, and in Web3 it almost always does, you cannot trust freeform text. Use the provider's JSON mode and check the result against a schema. I use zod.

import { z } from 'zod';

const PositionSummary = z.object({
  healthStatus: z.enum(['healthy', 'at_risk', 'liquidatable']),
  summary: z.string().max(500),
  warnings: z.array(z.string()),
  // Model echoes back the numbers it used, we verify against on-chain truth
  collateralUsd: z.number().nonnegative(),
  debtUsd: z.number().nonnegative(),
});

type PositionSummary = z.infer<typeof PositionSummary>;

async function getSummary(context: PositionContext): Promise<PositionSummary> {
  const raw = await generateWithFailover({
    system: SUMMARIZE_POSITION.system,
    user: buildContext(context),
    responseFormat: 'json',
  });

  const parsed = PositionSummary.safeParse(JSON.parse(raw.text));
  if (!parsed.success) {
    logger.error('schema validation failed', { errors: parsed.error.issues, raw: raw.text });
    throw new InvalidLLMOutputError(parsed.error);
  }
  return parsed.data;
}

A validation failure is a real event, not an edge case. Log it, count it, alert on it. A rising rate of schema failures is your first sign that a prompt change or a model update broke something.

Streaming UX

Users will wait for a streamed response far longer than they wait for a spinner. For anything conversational, stream the tokens as they arrive. The catch: you cannot check a schema mid-stream. So split it. Stream the human-readable text so it feels fast, but hold any structured, action-driving data behind a complete, validated response. Never let a half-streamed object trigger logic.

Observability and logging

You cannot improve what you cannot see. Log every call: prompt version, provider, token counts, latency, cost, and the validation result. The first time a user says "it gave me a weird answer," you want to pull the exact request and response, not shrug.

The cost field matters more than people expect. Per-call costs are tiny. Add them up across thousands of users and they become a line item, and you want to know which feature and which prompt version drives the spend before finance asks.

The Web3 Angle: Where This Gets Interesting

Everything above applies to any LLM integration. What makes Web3 different, and what makes it actually useful instead of a chatbot bolted onto a dapp, is grounding the model in on-chain reality.

A language model knows nothing about your protocol's current state. Ask it about a user's position with no real data and it will confidently make something up. The fix is the same pattern I lean on for building AI agents with blockchain integration: read the real state first, then hand it to the model as context. The model explains and summarizes. It is never the source of truth for a number.

Grounding on on-chain data

Read contract state directly with viem (or pull from an indexer or subgraph for anything that needs history or aggregation), then put those verified values into the prompt.

import { createPublicClient, http, formatUnits, parseAbi } from 'viem';
import { mainnet } from 'viem/chains';

const client = createPublicClient({ chain: mainnet, transport: http(process.env.ETH_RPC_URL) });

const POOL_ABI = parseAbi(['function getPosition(address user) view returns (uint256 collateral, uint256 debt)']);

async function buildPositionContext(pool: `0x${string}`, user: `0x${string}`) {
  const [collateral, debt] = await client.readContract({
    address: pool,
    abi: POOL_ABI,
    functionName: 'getPosition',
    args: [user],
  });

  // Hand the model VERIFIED numbers. It explains them; it does not source them.
  return {
    collateralUsd: Number(formatUnits(collateral, 18)),
    debtUsd: Number(formatUnits(debt, 18)),
    healthRatio: debt === 0n ? null : Number(collateral) / Number(debt),
  };
}

The rule here: the model never computes the health ratio and never invents a balance. You compute it on-chain, pass it in, and ask the model to turn it into a sentence a human can read. The numbers come from the chain. The words come from the LLM. Keep that line clear.

Summarizing transactions and protocol state

This is where the value shows up for real users. A raw transaction is a hex blob and a calldata dump. A position is a list of bigints. Most users can read neither. Decode the transaction or read the state yourself, feed the structured result to the model, and let it write "you supplied 2.5 ETH as collateral and borrowed 3,000 USDC; your position is healthy with a 1.8x ratio." That layer, verified data in and plain language out, is a real product, not a trick.

Letting an agent prepare, never blindly sign

Here is the hard line, and it does not move. An LLM can read on-chain state, reason about it, and prepare a transaction. It must never sign one on its own when funds are involved.

The model's output is, at most, a transaction request: a to, a value, encoded calldata, a plain description of intent. That request goes to a human, who reviews it and approves the signing. The model's text is untrusted input right up until a person looks at the decoded transaction and confirms it matches what they asked for.

// The agent PREPARES. A human SIGNS. These never collapse into one step.
interface PreparedTransaction {
  to: `0x${string}`;
  value: bigint;
  data: `0x${string}`;
  humanSummary: string; // what the user will read and approve
}

async function executeWithApproval(tx: PreparedTransaction, approver: HumanApprover) {
  const approved = await approver.confirm(tx.humanSummary, tx); // hard gate
  if (!approved) return { status: 'rejected' as const };
  return signAndSend(tx); // only reached after explicit human confirmation
}

I do not care how good the model is. Prompt injection is real (that is when hidden text in the input tricks the model into doing something you never asked for). Hallucinated parameters are real. A model that "decided" to approve unlimited token spend is a headline. The boundary between prepare and sign is the most important design decision in any Web3 LLM integration. Cross it for convenience and you will pay for it later with someone's money.

What Shippable Actually Means

A prototype answers a question once. A shippable LLM integration checks every output against a schema, fails over when a provider dies, caps what any user can spend, grounds every claim in on-chain truth, logs enough to debug a complaint three weeks later, and never lets the model touch funds without a human in the loop.

That is the part the weekend demo skips. None of it is exciting. All of it is the difference between a feature you can stand behind and one you spend your time apologizing for.

So before you ship: which of those six is missing from your build right now?

If you're building AI agents and LLM integration into a Web3 product and want the shippable version rather than the demo, hire me. This is the work I do.