Four important lessons about context engineering

jeudi 27 novembre 2025, 10:00 , par InfoWorld

Context engineering has emerged as one of the most critical skills in working with large language models (LLMs). While much attention has been paid to prompt engineering, the art and science of managing context—i.e., the information the model has access to when generating responses—often determines the difference between mediocre and exceptional AI applications.

After years of building with LLMs, we’ve learned that context isn’t just about stuffing as much information as possible into a prompt. It’s about strategic information architecture that maximizes model performance within technical constraints.

The technical reality of context windows

Modern LLMs operate with context windows ranging from 8K to 200K+ tokens, with some models claiming even larger windows. However, several technical realities shape how we should think about context:

Lost in the middle effect: Research has consistently shown that LLMs experience attention degradation in the middle portions of long contexts. Models perform best with information placed at the beginning or end of the context window. This isn’t a bug. It’s an artifact of how transformer architectures process sequences.

Effective vs. theoretical capacity: A model with a 128K context window doesn’t process all 128K tokens with equal fidelity. Beyond certain thresholds (often around 32K to 64K tokens), accuracy degrades measurably. Think of it like human working memory. We can technically hold many things in mind, but we work best with a focused subset.

Computational costs: Context length impacts latency and cost quadratically in many architectures. A 100K token context doesn’t cost 10x a 10K context, it can cost 100x in compute terms, even if providers don’t pass all costs to users.

What we learned about context engineering

Our experience building an AI CRM taught us four important lessons about context engineering:

Recency and relevance trump volume

Structure matters as much as content

Context hierarchy creates better retrieval

Stateless is a feature, not a bug

I will unpack each of those below, then I’ll share some practical tips, useful patterns, and common antipatterns to avoid.

Lesson 1: Recency and relevance trump volume

The most important insight: more context isn’t better context. In production systems, we’ve seen dramatic improvements by reducing context size and increasing relevance.

Example: When extracting deal details from Gmail, sending every email with a contact performs worse than sending only emails semantically related to the active opportunity. We’ve seen models hallucinate close dates by pulling information from a different, unrelated deal mentioned six months ago because they couldn’t distinguish signal from noise.

Lesson 2: Structure matters as much as content

LLMs respond better to structured context than unstructured dumps. XML tags, markdown headers, and clear delimiters help models parse and attend to the right information.

Poor context structure:

Here’s some info about the user: John Smith, age 35, from New York, likes pizza, works at Acme Corp, signed up in 2020, last login yesterday…

Better context structure:

xml

John Smith
35
New York

2020-03-15
2024-10-16

Pizza

The structured version helps the model quickly locate relevant information without parsing natural language descriptions.

Lesson 3: Context hierarchy creates better retrieval

Organize context by importance and relevance, not chronologically or alphabetically. Place critical information early and late in the context window.

Optimal ordering:

System instructions (beginning)

Current user query (beginning)

Most relevant retrieved information (beginning)

Supporting context (middle)

Examples and edge cases (middle-end)

Final instructions or constraints (end)

Lesson 4: Stateless is a feature, not a bug

Each LLM call is stateless. This isn’t a limitation to overcome, but an architectural choice to embrace. Rather than trying to maintain massive conversation histories, implement smart context management:

Store full conversation state in your application

Send only relevant history per request

Use semantic chunking to identify what matters

Implement conversation summarization for long interactions

Practical tips for production systems

Tip 1: Implement semantic chunking

Don’t send entire documents. Chunk content semantically (by topic, section, or concept) and use embeddings to retrieve only relevant chunks.

Implementation pattern:

Query → Generate embedding →
Similarity search → Retrieve top-k chunks →
Rerank if needed → Construct context → LLM call

Typical improvement: 60% to 80% reduction in context size with 20% to 30% improvement in response quality.

Tip 2: Use progressive context loading

For complex queries, start with minimal context and progressively add more if needed:

First attempt: core instructions plus query

If uncertain: add relevant documentation

If still uncertain: add examples and edge cases

This reduces average latency and cost while maintaining quality for complex queries.

Tip 3: Context compression techniques

Three techniques can compress context without losing information:

Entity extraction: Instead of full documents, extract and send key entities, relationships, and facts.

Summarization: For historical conversations, summarize older messages rather than sending verbatim text. Use LLMs themselves to create these summaries.

Schema enforcement: Use structured formats (JSON, XML) to minimize token usage compared to natural language.

Tip 4: Implement context windows

For conversation systems, maintain sliding windows of different sizes:

Immediate window (last three to five turns): Full verbatim messages

Recent window (last 10 to 20 turns): Summarized key points

Historical window (older): High-level summary of topics discussed

Tip 5: Cache smartly

Many LLM providers now offer prompt caching. Structure your context so stable portions (system instructions, reference documents) appear first and can be cached, while dynamic portions (user queries, retrieved context) come after the cache boundary.

Typical savings: 50% to 90% reduction in input token costs for repeated contexts.

Tip 6: Measure context utilization

Instrument your system to track:

Average context size per request

Cache hit rates

Retrieval relevance scores

Response quality vs. context size

This data reveals optimization opportunities. We’ve found that many production systems use 2x to 3x more context than optimal.

Tip 7: Handle context overflow gracefully

When context exceeds limits:

Prioritize user query and critical instructions

Truncate middle sections first

Implement automatic summarization

Return clear errors rather than silently truncating

Advanced patterns

Multi-turn context management

For agentic systems that make multiple LLM calls:

Pattern: Maintain a context accumulator that grows with each turn, but implement smart summarization after N turns to prevent unbounded growth.

Turn 1: Full context
Turn 2: Full context + Turn 1 result
Turn 3: Full context + Summarized(Turns 1-2) + Turn 3

Hierarchical context retrieval

For retrieval-augmented generation (RAG) systems, implement multi-level retrieval:

Retrieve relevant documents

Within documents, retrieve relevant sections

Within sections, retrieve relevant paragraphs

Each level narrows focus and improves relevance.

Context-aware prompt templates

Create templates that adapt based on available context:

if context_size < 4000:
template = detailed_template # Room for examples
elif context_size < 8000:
template = standard_template # Concise instructions
else:
template = minimal_template # Just essentials

Common antipatterns to avoid

Antipattern 1: Sending entire conversation histories verbatim. This wastes tokens on greetings, acknowledgments, and off-topic banter.

Antipattern 2: Dumping database records without filtering. Send only fields relevant to the query.

Antipattern 3: Repeating instructions in every message. Use system prompts or cached prefixes instead.

Antipattern 4: Ignoring the lost-in-the-middle effect. Don’t bury critical information in long contexts.

Antipattern 5: Over-relying on maximum context windows. Just because you can use 128K tokens doesn’t mean you should.

Looking forward

Context engineering will remain critical as models evolve. Emerging patterns include:

Infinite context models: Techniques for handling arbitrarily long contexts through retrieval augmentation

Context compression models: Specialized models that compress context for primary LLMs

Learned context selection: Machine learning models that predict optimal context for queries

Multi-modal context: Integrating images, audio, and structured data seamlessly

Effective context engineering requires understanding both the technical constraints of LLMs and the information architecture of your application. The goal isn’t to maximize context. It’s to provide the right information, in the right format, at the right position.

Start by measuring your current context utilization, implement semantic retrieval, structure your context clearly, and iterate based on quality metrics. The systems that win aren’t those that send the most context. They’re those that send the most relevant context.

The future of LLM applications is less about bigger context windows and more about smarter context engineering.

—

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

Lire la suite sur InfoWorld