Skip to main content

2 posts tagged with "llm-agents"

View All Tags

Why You Should Optimize Context, Not Prompts

· 13 min read

Prompt optimization is the obvious next step after prompt engineering.

If hand-written prompts are brittle, let an optimizer improve them. Let it search over instructions. Let it select examples. Let it compile a better prompt from data and metrics.

This is a reasonable idea. It has produced useful systems. DSPy is a strong representative of this direction: it frames itself as a framework for programming rather than prompting language models, and its optimizers can tune the prompts or LM weights of a DSPy program against a user-specified metric.

But for many agents, prompt optimization is not the most important optimization surface.

The problem is not that prompts do not matter.

The problem is that prompts are often the wrong thing to optimize first.

An agent prompt is not just an instruction. It is a mixture:

prompt =
system_instruction
+ role
+ context_data
+ examples
+ user_instruction

Prompt optimization usually focuses on the instruction-shaped parts of this mixture: the wording, the demonstrations, the phrasing of the task, the examples that surround the call.

But agent behavior often depends more on context_data: retrieved documents, tool observations, memory records, intermediate state, previous generated values, summaries, citations, and outputs from other agents.

If the model sees the right evidence, a mediocre instruction often works.

If the model sees the wrong evidence, a beautiful instruction only makes the wrong answer more fluent.

This post is about a different optimization target:

Do not only optimize what you say to the model. Optimize what the model sees.

Selectable context slots make context optimization a source-level decision

The Prompt Is Not the Unit of an Agent

For a single LLM call, the prompt feels like the obvious unit.

You write an instruction. You provide a few examples. You test the output. You adjust the wording.

That mental model breaks down in agents.

An agent does not just have a prompt. It has a changing context boundary.

A research agent may have:

  • the user question
  • search queries
  • search results
  • retrieved documents
  • summaries of those documents
  • memory records from previous runs
  • intermediate observations
  • failed attempts
  • tool errors
  • outputs from helper agents
  • final answer requirements

Some of that data should enter the next model call. Some should not.

The failure mode is often not:

The instruction was unclear.

It is:

The model saw the wrong context.

Or:

The right context existed in the program, but it never entered the prompt.

Or:

The context entered the prompt, but in the wrong form.

A tool result might be too large. A memory query might retrieve stale advice. A summary might drop the citation needed by the final answer. A reasoning step may need a compact digest, while the answer step needs source-preserving excerpts.

These are not mainly wording problems.

They are context selection problems.

Prompt Text Is a Soft Optimization Variable

Prompt optimization has an attractive promise: replace manual prompt tweaking with search.

DSPy's MIPROv2, for example, jointly optimizes instructions and few-shot examples. The MIPRO work frames the problem as optimizing free-form instructions and demonstrations for modules in multi-stage LM programs, using downstream metrics without gradients or module-level labels.

That is a serious and useful line of work.

But the optimized object is still soft.

Prompt wording has difficult properties:

high-dimensional
free-form
model-sensitive
hard to constrain
hard to compare semantically
easy to overfit
hard to debug when it works
hard to debug when it fails

A prompt diff can show that some words changed. It rarely explains why the system became more reliable.

Did the model improve because the new instruction is genuinely better?

Because one phrase happens to activate a behavior in this model?

Because a few-shot example resembles the validation set?

Because the optimizer found a local trick that will disappear with the next model version?

Because the downstream metric is incomplete?

These questions are not signs that prompt optimizers are useless. They are signs that free-form prompt text is a difficult optimization surface.

It is too close to model-specific behavior.

It is too far from the actual data-flow problem inside many agents.

Context Is a Better Optimization Surface

Context is different.

Context is not primarily a sentence.

Context is a set of choices.

Should the model see raw tool output or a summary?

Should it see the top three documents or the top ten?

Should it see recent memory, durable lessons, or no memory at all?

Should it see citation-preserving excerpts or compressed observations?

Should a final answer step receive the full evidence bundle, while an earlier planning step only receives a small digest?

These are structured decisions.

They are easier to enumerate:

none
compact digest
summary
top documents
source-preserving excerpts
combined memory + docs

They are easier to inspect:

This run used docs.top5 max 4k.
That run used scratch.digest max 500.
This run omitted memory entirely.

They are easier to trace:

Which context source entered the prompt?
How much token budget did it receive?
Was it clipped?
Was the source URL preserved?
Was this context slot omitted?

They are also more portable across models.

Different models may respond differently to the same phrase. But the question "should the model see the retrieved documents or only a summary?" is less tied to one model's prompt quirks. GPT, Claude, Gemini, Qwen, and local models may differ in style, but they all need the right evidence.

Context optimization is not magic. It can overfit too. A context policy selected on one dataset may fail on another distribution.

But when it fails, the failure is inspectable.

You can see the selected source. You can see what was omitted. You can see whether the model had access to the evidence. You can change a structured choice, not a spell.

From Prompt Engineering to Context Engineering

Prompt engineering asks:

What should I say to the model?

Context engineering asks:

What should the model see?

That shift matters.

In ordinary Python or TypeScript agents, context often appears as strings, message arrays, framework objects, or prompt templates. The boundary between local program data and prompt context is maintained by convention.

AgentScript starts from a stricter rule:

Local data is not prompt context.

Tool results, memory query results, intermediate observations, trace events, and outputs from other agents do not enter the prompt automatically.

If the model should see a value, the program must select it with use.

use input.question as "user question"
use docs.summary max 4k as "evidence"
use past max 2k as "past lessons"

This is the first step.

Context becomes visible.

A use declaration says:

source = docs.summary
label = "evidence"
budget = 4k

It is not just a string template. It is a context contract.

The next step is natural:

If context can be explicit, context alternatives can be explicit too.

use one of: Context Search Space as Code

AgentScript can express a context choice directly:

use one of {
none: empty
compact: scratch.digest max 500
verbose: scratch.summary max 4k
grounded: docs.top5 max 4k selected
} as "evidence"

This says:

There is a context slot named "evidence".

It can be absent.

It can use a compact digest.

It can use a verbose summary.

It can use retrieved documents.

The current selected version is grounded.

This is not runtime magic.

There is no hidden optimizer running inside the agent.

There is no invisible learning state.

There is only a declaration of possible context sources, and a visible selected choice.

use one of declares the search space.

selected records the current specialization.

empty means the context slot is absent.

The model does not see the candidate list. It only sees the selected context source.

If grounded is selected, the prompt receives an evidence section built from docs.top5.

If none: empty is selected, the prompt receives no evidence section at all.

The source remains auditable.

The runtime remains simple.

Why empty Belongs in the Same Mechanism

It is tempting to invent a separate syntax for optional context.

For example:

use? docs.top5 max 4k as "evidence"

But that makes optionality a separate language mechanism.

A simpler model is to treat absence as one candidate in the same context choice:

use one of {
none: empty
docs: docs.top5 max 4k
} as "evidence"

This is easier to read and easier to optimize.

The search space is explicit:

evidence ∈ { none, docs }

The default is explicit too. If none is first, the unoptimized program omits the context slot by default. If docs is first, the unoptimized program includes it by default.

No extra optional-use syntax is needed.

Absence is just another context choice.

selected Is a Source-Level Decision

The important part of this design is where the optimization result lives.

It lives in source code.

use one of {
none: empty
compact: scratch.digest max 500
verbose: scratch.summary max 4k
grounded: docs.top5 max 4k selected
} as "evidence"

The current choice is not hidden in a database.

It is not stored in a runtime profile.

It is not controlled by an ambient optimizer.

It is written next to the alternatives.

This makes the program self-describing.

A human can review it.

A tool can parse it.

A trace can point back to it.

A version control diff can show exactly what changed:

use one of {
none: empty
compact: scratch.digest max 500
- verbose: scratch.summary max 4k selected
- grounded: docs.top5 max 4k
+ verbose: scratch.summary max 4k
+ grounded: docs.top5 max 4k selected
} as "evidence"

This is a very different kind of optimization artifact from a rewritten prompt.

The output is not a new incantation.

The output is a changed context decision.

Optimization as Source-to-Source Specialization

The cleanest optimization model for AgentScript is source-to-source specialization.

An optimizer does not need to mutate the runtime.

It does not need to inject hidden state.

It does not need to rewrite the prompt text.

It can do something much simpler:

read AgentScript source
find use one of sites
evaluate candidates
move the selected marker
write new AgentScript source

The input is AgentScript.

The output is AgentScript.

The runtime still executes ordinary AgentScript.

There are two useful output forms.

The development form keeps the search space:

use one of {
none: empty
compact: scratch.digest max 500
verbose: scratch.summary max 4k
grounded: docs.top5 max 4k selected
} as "evidence"

This form is good for review, auditing, and future optimization.

The production form can be flattened:

use docs.top5 max 4k as "evidence"

If the selected variant is empty, the production form can remove the context slot entirely:

// evidence omitted

Both forms are understandable.

Neither requires a special runtime learning mechanism.

Runtime Should Be Boring

This is the main design constraint.

The runtime should not be where learning hides.

At runtime, a use one of site resolves to exactly one candidate.

The rule is deterministic:

if a candidate is marked selected:
use that candidate
else:
use the first candidate

Then the selected candidate behaves like ordinary use.

If the candidate is non-empty:

grounded: docs.top5 max 4k selected

the runtime behaves as if the program had written:

use docs.top5 max 4k as "evidence"

If the candidate is empty:

none: empty selected

the runtime behaves as if there were no use for that context slot.

That is all.

No prompt rewriting.

No hidden profile overlay.

No adaptive runtime state.

No agent self-modification during execution.

This keeps the trace honest. When a trace says the model saw docs.top5 max 4k as evidence, the source code contains the corresponding selected candidate.

Why This Is More Auditable Than Prompt Optimization

Prompt optimization often produces an artifact that looks like text:

You are a careful and precise assistant. Use the provided context...

The optimizer may have improved it, but the result is still hard to inspect as an engineering artifact.

A reviewer sees changed wording.

They may not know what behavior changed.

They may not know whether the improvement will transfer to another model.

A context optimization diff is different.

- verbose: scratch.summary max 4k selected
+ grounded: docs.top5 max 4k selected

This says something concrete:

The model now sees retrieved documents instead of a summary.

Or:

- lessons: Lessons.relevant(input.question) max 1k selected
+ none: empty selected

This says:

The model no longer receives memory in this step.

These are engineering decisions.

They can be reviewed.

They can be traced.

They can be tested.

They can be reverted.

They can be discussed without guessing which phrase happened to work.

Context Optimization Still Needs Evaluation

None of this removes the need for evaluation.

An optimizer still needs a signal.

That signal may come from fixtures, tests, validation output, user feedback, downstream task success, or human review.

Context optimization can still overfit.

A candidate that works on one fixture set may fail elsewhere. A memory policy that improves one workflow may pollute another. A larger evidence bundle may improve factuality but hurt latency and cost.

The difference is not that context optimization is automatically correct.

The difference is that its variables are visible.

When it fails, you can inspect the context choices directly.

When it improves, you can often explain why.

That matters for agents because agents are not just single model calls. They are programs. Programs need reviewable state, explicit boundaries, and understandable changes.

What This Means for AgentScript

AgentScript's first bet was:

Agent context should be code.

That is what use provides.

use input.question as "user question"
use docs.summary max 4k as "evidence"

The next bet is:

Agent context search space should be code too.

That is what use one of provides.

use one of {
none: empty
compact: scratch.digest max 500
verbose: scratch.summary max 4k
grounded: docs.top5 max 4k selected
} as "evidence"

This is the bridge from context engineering to context optimization.

The language does not need to make prompt text adaptive.

It does not need to optimize role descriptions.

It does not need to let agents rewrite their control flow.

It can keep the core model small:

use declares what the model can see
generate declares where the model is called
one of declares context alternatives
selected records the current context choice
empty records deliberate absence

That is enough to make context optimization possible without hiding it in the runtime.

The Real Question

The old prompt engineering question was:

What should I say to the model?

AgentScript's first question was:

What did the model actually see?

Context optimization adds a second question:

What could the model have seen, and why did we choose this version?

That is the difference between prompt optimization and context optimization.

Prompt optimization searches over language.

Context optimization searches over evidence, memory, summaries, tool results, budgets, and absence.

For agents, that is often the more important surface.

Not because prompts do not matter.

Because in real agent programs, the model's next output is usually constrained less by the beauty of the instruction and more by the quality of the information placed in front of it.

Reliable agents need more than better prompts.

They need context choices that are explicit, scoped, auditable, and eventually optimizable.

That is the direction AgentScript is exploring:

Agent context as code.
Context search space as code.
Learning results as code.

What Did the Model Actually See?

· 11 min read

Introducing AgentScript, a small language for explicit, scoped, auditable LLM context.

Many programmers of my generation first learned computing through a simple model: input, processing, output.

A program receives data, transforms it, and produces a result. It is an old mental model, but still a useful one. It made programs feel understandable because the boundaries were visible.

LLM agents stretch that model.

The input is no longer just a file, a request, or a record with a known schema. It is prompt context: user intent, tool observations, retrieved documents, memory records, intermediate state, retry messages, and outputs from other agents.

The output is no longer just a return value. It is generated text or JSON that may need to satisfy a contract before the next step can trust it.

Most agent programs do not fail because calling a model is hard.

They fail because nobody can tell, with confidence, what the model actually saw before it generated the next value.

After a few iterations, an agent has local variables, tool results, memory records, intermediate observations, retry messages, and outputs from other agents. Some of that data should reach the next model call. Some should not. In most Python or TypeScript agents, that boundary is maintained by convention.

That works for small demos. It becomes fragile in real workflows.

Traditional append-only chat versus AgentScript scoped context boundaries

What did the model actually see? Which tool result was included in the prompt, and which one was only local data? Was memory clipped? Did another agent's output enter as evidence or as prior assistant text? What exactly must the model return before that value flows into the next step?

AgentScript is an experiment in making those questions answerable from the program itself.

What AgentScript Is

AgentScript is a small language for building LLM agents where prompt context is explicit, scoped, typed, traceable, and auditable.

It is aimed at developers building multi-step agents where tool output, memory, intermediate state, and generated values must be controlled and audited.

It is not a prompt template format. It is not YAML configuration. It is not a general-purpose agent framework.

Its core idea is simple:

Agent context should be code.

The two most important language features are use and generate.

use content max 8k as "file content"

generate({
input: "Summarize the file for a busy teammate",
max_output: 1000
}) -> {
title
summary
key_points: list[string]
action_items: list[string]
}

use declares what the model is allowed to see. generate declares where the model is called and, when needed, what contract its output must satisfy. In the old input/process/output framing, AgentScript puts language-level attention on the two unstable edges of LLM programs: prompt input and generated output.

The important part is not that AgentScript can call an LLM. The important part is that the prompt boundary is visible in the code.

Everything else in the language exists to support that workflow: variables, functions, agents, imports, loops, tools, memory, and trace output.

Why Not Just Use Python or TypeScript?

Python and TypeScript are excellent general-purpose languages, and AgentScript is not trying to replace them.

The problem is that they do not have a native concept of prompt context. Context usually appears as strings, arrays, objects, templates, framework calls, or message lists. The program can be correct, but the intent is scattered across ordinary code:

const messages = [
system("You are a reviewer"),
user(`Question: ${input.question}`),
user(`Search results: ${JSON.stringify(results)}`),
user(`Memory: ${memory.map((item) => item.text).join("\n")}`),
];

const answer = await model.generate(messages);

Which fields from results are included? Was raw tool output included? Was memory clipped? Did another agent's output enter as evidence or as prior assistant text? What schema must answer satisfy?

AgentScript makes context selection a first-class operation:

use input.question as "user question"
use results.summary max 4k as "search results"
use past max 2k as "past lessons"

generate({
input: "Answer using only the selected context",
max_output: 800,
strict: true
}) -> {
answer
citations: list[string]
}

Labels can be simple identifiers or quoted strings.

Local variables do not enter prompts automatically. Tool results do not enter prompts automatically. Memory query results do not enter prompts automatically. Trace events do not enter prompts automatically.

If data should be visible to the model, it must be selected with use.

That one rule changes the contract of agent development. The prompt is no longer a side effect of arbitrary string assembly. It is a scoped contract.

A Minimal Example

Here is a complete file summarizer:

import llm Qwen from "ollama://localhost:11434/qwen3.6"
import tool File from "file://workspace"

main agent FileSummarizer {
model Qwen
role "Technical Writer"
description "Read one local file and produce a useful structured summary."

main func(input { path: string }) {
file = File.read({
path: input.path
})

use input.path as "source path"
use file.content max 8k as "file content"

generate({
input: "Summarize the file for a busy teammate",
max_output: 1000
}) -> {
title
summary
key_points: list[string]
action_items: list[string]
}
}
}

The file tool can read from the workspace, but the tool result does not implicitly become prompt context. The program explicitly selects the path and file content, labels them, gives the content a budget, and then asks the model for a structured result.

Run it with a real model:

agentscript recipes/summarize-file.as --input '{"path":"README.md"}'

Or try it immediately with deterministic output and a trace:

npm install -g @rong/agentscript
agentscript recipes/summarize-file.as --input '{"path":"README.md"}' --mock --trace

Run it with a deterministic mock model:

agentscript recipes/summarize-file.as --input '{"path":"README.md"}' --mock

Inspect the prompt and trace without calling a model:

agentscript recipes/summarize-file.as --input '{"path":"README.md"}' --dry-run

Print an auditable trace:

agentscript recipes/summarize-file.as --input '{"path":"README.md"}' --trace

The trace can show which context sources were selected, which budgets were applied, what was clipped, which instruction was used, what output contract was requested, and whether validation passed. That trace is for debugging and audit. It is not itself prompt context.

For example, the useful part of a trace is not just that a model was called. It is the boundary around that call:

Generate #1
Agent: FileSummarizer / Technical Writer
Selected context:
[source path] input.path
[file content] content, budget=8k, clipped=false
Instruction:
Summarize the file for a busy teammate
Output contract:
title: string
summary: string
key_points: list[string]
action_items: list[string]
Validation: ok

generate Is the Only LLM Call Site

In AgentScript, ordinary code can compute values, call tools, query memory, call other agents, and organize intermediate state. Only generate asks a model to produce new output.

answer = generate({
input: "Answer using only the selected context.",
max_output: 800,
attempts: 3,
strict: true
}) -> {
ok: boolean
answer
citations: list[string]
}

The contract after -> is an output contract. AgentScript can ask providers for structured output when possible, validate the returned value, and retry when the model returns invalid JSON or a mismatched contract. Downstream code can then depend on the returned contract instead of parsing prose.

This gives each model call a visible boundary:

  • the current agent identity
  • the selected context from visible use declarations
  • the local instruction in generate({ input: ... })
  • the optional output contract after ->

That boundary is the unit you can review, debug, and trace.

Scope Is the Context Boundary

AgentScript uses scope to control prompt visibility.

This is also a way to avoid the pressure of long conversations. In a traditional chat loop, each step tends to append more messages to the same history. The context grows heavier over time, and the next model call inherits whatever the conversation happened to accumulate.

AgentScript treats each generation differently. Before a generate, the program selects the visible context deliberately with use: the specific values, labels, and budgets that matter for this step. It is closer to precise sampling than to endless appending.

A use declaration is visible to later generate calls in the same scope and child scopes. It does not leak upward. Function calls and agent calls create independent context boundaries.

func caller(input) {
use input.goal as goal
helper(input)
}

func helper(input) {
use input.detail as detail

generate({ input: "Work on the detail" }) -> {
ok: boolean
}
}

The generate inside helper sees input.detail. It does not automatically inherit caller's selected goal context.

Agent calls are isolated in the same way. A called agent sees the input value passed to it and the context selected inside its own functions. It does not inherit the caller's prompt context.

That makes multi-agent composition easier to audit. Each agent has its own prompt contract instead of sharing an ambient conversation buffer.

Tool Results Are Data, Not Prompt

One of AgentScript's most important rules is that tool results are local program data. They are not prompt context until selected.

This matters in repository review, research, code analysis, and any workflow where tools can return much more data than the model should see.

A repository review can collect a file tree, TODO matches, package metadata, and CI configuration. The review step can then choose only the relevant pieces:

use "file tree" budget=8k
use "todo findings" budget=4k
use "package metadata" budget=4k
use "ci configuration" budget=4k
generate blockers, risks, quick_wins, next_steps

The distinction is deliberate.

Tools expand what the program can do. use controls what the model can see.

Memory Is Explicit Too

AgentScript includes file JSONL and SQLite memory backends, but memory follows the same rule as everything else.

The memory handle is a capability, not prompt data:

import memory Lessons from "file://./.agentscript/lessons.jsonl"

The agent must query memory, receive ordinary data, and then explicitly select that data if it should influence the next generation:

past = Lessons.query({
text: input.goal,
kind: "lesson",
limit: 5
})

use input.goal as goal
use past max 2k as "past lessons"

Writing to memory is also explicit:

Lessons.add({
kind: "lesson",
text: reflection.insight,
goal: input.goal
})

This supports reflection and self-improvement without automatic context growth. A future run can use durable lessons, but only through a visible query and a visible use.

Agent Patterns as Composable Primitives

AgentScript does not hardcode agent patterns as keywords.

There is no special planner keyword. No special executor keyword. No special reflect keyword. Those names are just agents, functions, or ordinary data in your program.

That is intentional. ReAct, plan-and-execute, evaluator-optimizer, reflection, self-improvement, and multi-agent workflows can all be built from the same small set of primitives:

  • agents and functions for boundaries
  • tools for external capabilities
  • memory for durable explicit state
  • use for prompt context selection
  • generate for model calls and output contracts
  • trace for auditability

For independent bounded work, AgentScript also provides parallel for:

results = parallel for step in plan.steps max 10 {
Executor({
goal: input.goal,
step: step
})
}

For bounded independent work, parallel for is designed for multi-agent and multi-generate bottlenecks without exposing async/await.

The result is still local data. It enters a later prompt only if selected:

use results.summary max 6k as execution_results

Current Status

AgentScript is experimental, but the core language design is now in place.

Currently implemented:

  • parser
  • semantic checker
  • mock runtime
  • OpenAI, Anthropic, and Ollama LLM adapters
  • file, environment, HTTP, and shell-style host tools
  • JSONL and SQLite memory backends
  • structured output validation
  • trace output
  • arithmetic and comparison operators
  • compound assignment
  • parallel for
  • runtime concurrency control for parallel for
  • CLI support for --mock, --dry-run, --trace, --trace-file, --check, and --concurrency

The implementation is usable for experimentation, examples, and local workflows, but the language is still pre-1.0 and may change.

Planned work includes a stable IR, richer diagnostics, and VS Code syntax support.

The project is still early. The goal right now is not to claim that AgentScript is a mature production framework. The goal is to test a sharper language idea:

What if the most important part of an agent program is not the framework around the model call, but the context contract before it?

Try It

Install the CLI:

npm install -g @rong/agentscript

Run a recipe:

agentscript recipes/summarize-file.as --input '{"path":"README.md"}'

Run without installing:

npx @rong/agentscript recipes/code-review.as --input '{"path":"src"}'

Use mock mode for deterministic local checks:

agentscript recipes/summarize-file.as --input '{"path":"README.md"}' --mock

Use trace mode when you want to inspect what happened:

agentscript recipes/summarize-file.as --input '{"path":"README.md"}' --trace

Project links:

Closing Thought

LLM agents are often described in terms of tools, memory, planning, and autonomy. Those things matter, but they all depend on a more basic question:

What exactly did the model see before it generated the next value?

AgentScript is built around that question. It treats prompt context as something you declare, scope, budget, label, validate, and trace.

That is the language's bet: reliable agents need context engineering to be a programming model, not a pile of conventions.