Deep Research Agent in Practice (Part 1): Guide to Architecture Design and Evaluation System Building
Hey everyone! Have you been bombarded with all kinds of powerful Agent applications lately? Have you ever wondered how a 'Deep Research' Agent is actually built?
Today, let's deconstruct the high-quality course "Deep Research from Scratch" officially released by LangGraph. Forget about code for now — let's understand the architecture design and evaluation system of a top-tier Research Agent!
This course is extremely high quality. Relying on just one external tool — Tavily (a search tool) — it implements a powerful deep research Agent. Even better, what it teaches us is not just code, but the prompt design principles and Agent evaluation system behind it. The content is incredibly valuable.
In this article, I'll distill the essence of the course and guide you step by step in building a research assistant named "Fairy." I hope Fairy can become your flexible, customizable partner — like a more open, more native Claude Code.
Ready? Let's get started!
Overall Blueprint for a Deep Research Agent
A reliable deep research Agent's workflow can be divided into two core phases:
- Phase 1: Scoping. The Agent acts like a patient consultant, engaging in multiple rounds of dialogue with users until their intent is completely clear, ultimately producing a clear "research outline."
- Phase 2: Execution. Once it has the outline, the Agent begins formal work. It deeply searches the internet, self-validates the relevance of information, and finally compiles a structured report.
You might ask: Why do we need this layered approach?
Because in real "deep research" scenarios, the hardest part is often not "searching" but "defining the problem." Users' initial requests are usually vague (like "help me research AI"), and diving straight into search only leads to information overload or going off track. This layered design acts like an intelligent filter, ensuring we stay on the right path every step of the way.
The overall Agent workflow design is as follows:
- User onboarding: User submits an initial research intent.
- Research Scoping:
- Agent analyzes user requirements.
- If requirements are vague, Agent asks clarifying questions.
- After user responds, Agent updates understanding of requirements.
- Loop until a clear, boundary-defined Research Brief is generated.
- Research Execution:
- Agent breaks down specific search queries based on the research brief.
- Calls search tools (like Tavily) to gather information.
- Information compression and summarization: Summarize search results, extract key information, avoid context overflow.
- Decision and routing: Determine if current information is sufficient? If not, continue searching; if sufficient, proceed to report writing.
- Report generation: Write the final deep research report based on all collected structured information.
Defining Research Boundaries (Scoping)
In the deep research workflow, one of the biggest challenges is that users rarely provide sufficient information in their initial requests. Requests often lack important details, making direct execution often yield half the results with twice the effort.
In this module, our core goal is to clarify the following points:
- Scope and boundaries: What should the research include? What should be explicitly excluded? (e.g., only study data after 2023, or only focus on specific regions).
- Audience and purpose: Who is this research for? Is the purpose for investment decisions, academic research, or popular science?
- Specific requirements: Are there specific source preferences, time frames, or format limitations?
- Terminology clarification: What do specific domain terms or abbreviations mentioned by the user mean?
At this stage, the Agent's strategy is: don't make blind assumptions.
We'll build a graph containing states like ClarifyWithUser (clarify with user) and Summary (generate summary).
- Input: User's original research requirements.
- Processing: LLM determines if information is complete. If incomplete, generate clarifying questions; if complete, generate research brief.
- Output: A structured Research Brief containing research topic, target audience, list of key questions, etc.
Only after this research brief is established will the Agent enter the research execution phase, which consumes more tokens.
Research Execution & Report Writing (Execution)
Once the research brief is determined, we enter the research execution phase. This phase is a dynamic loop process, mainly consisting of the following nodes:
- LLM decision node: This is the Agent's brain. It analyzes the currently collected information (Context) and research objectives, deciding whether to continue searching for new content or if enough information has been collected to conclude the research.
- Tool execution node: When the LLM decides more information is needed, it generates search keywords and calls search tools (like Tavily).
- Research compression node (Context Management): As searching proceeds, web content becomes very abundant. Directly piling it into Context leads to Token explosion and interferes with model judgment. Therefore, we need a "compression node":
- Web content summarization: Immediately summarize crawled web content (Web Summary).
- Research result compression: Compress information discovered across the entire research chain, removing duplicate and irrelevant information.
About prompt design challenges: At this stage, the hardest part is not getting the LLM to search, but controlling the number of tool calls.
- Too few: Research lacks depth, report is hollow.
- Too many: Fall into a rabbit hole, wasting large amounts of Tokens with diminishing marginal returns.
Therefore, we need to embed heuristic strategies in the Prompt, such as forcing the model to "show your thinking process" (Chain of Thought) and setting clear stopping conditions (e.g., when all key questions are answered and supported by data).
Agent Module Evaluation System
Building an Agent isn't hard — evaluating whether an Agent is good is. For deep research Agents, we need to establish an LLM-as-a-Judge evaluation system.
We'll break this evaluation into two parts:
- Scoping evaluation: Are user requirements clear? Do clarifying questions hit the mark?
- Execution evaluation: Is the search path reasonable? Does it avoid too deep or too shallow exploration?
According to course recommendations, building high-quality LLM evaluators requires following these principles:
- Role Definition with Expertise Context
- For example: "You are a senior research brief auditor, specifically responsible for reviewing the completeness and feasibility of research outlines."
- Clear Task Specification
- Try to use binary judgments (Pass/Fail), because models are often not stable enough in scoring, while Pass/Fail boundaries are clearer.
- Rich Contextual Background
- Tell the evaluator what a "good" research brief looks like. For example: a good brief must contain a clear time range and target audience.
- Structured XML Organization
- Use XML tags (like
<criteria>,<input>,<evaluation>) to isolate context, helping the model parse complex instructions.
- Use XML tags (like
- Comprehensive Guidelines with Examples
- Provide 3-4 specific examples for each evaluation criterion, covering positive (Pass) and negative (Fail) cases as well as edge cases.
- Explicit Output Instructions
- Force the model to output "reasoning process" first, then "conclusion."
- Bias Reduction Techniques
- Set a "strict but fair" tone. Tell the model: "When in doubt, tend to rule as Fail," to ensure that passing content is truly high quality.
Below, let's use actual examples to illustrate what kind of evaluator should be defined.
Evaluating Scoping Phase Agent Execution: Are Research Boundaries Clear?
For the "defining research boundaries" phase, we construct an evaluator specifically to check the generated "research brief."
The evaluation goal is to see if the generated brief is sufficient to guide subsequent deep search.
The test set is a group of user and Agent dialogue records, and the research background brief generated by the Agent is the test output.
The evaluator then calculates evaluation results based on the test set's reference answers and test output.
Here there are two key metrics:
- Completeness: Did it miss key constraints from user intent?
- Clarity: Are there ambiguous terms?
If the Agent generates a vague brief at this step (e.g., "research AI"), the evaluator should directly rule it as Fail — the Agent is unqualified.
Evaluating Execution Phase Agent Tool Calling: Is It Reasonable?
For the "research execution" phase, we mainly evaluate the Agent's Tool Calling Behavior.
Common failure modes include:
- Premature Stop: Agent only searches once, looks at the summary and feels "that's about it," prematurely concluding, resulting in a report lacking depth.
- Infinite Loop: Agent is never satisfied with existing information, repeatedly searching similar keywords, falling into an endless loop.
The evaluator needs to check the Agent's operation trajectory:
- Did it try new keywords when hitting dead ends?
- Did it effectively use previous search results (rather than repeating searches)?
- Is its reason for stopping search sufficient (e.g., "found latest sources for all key data")?
Therefore, our test set input is the Agent's research tool chain, and the test output is the Agent's decision on whether to continue research.
Based on reference answers and test output, we can calculate evaluation results. Whether the Agent can control stopping the search at just the right point.
Next Steps
This time, we first covered the overall concept of building a deep research Agent, including business workflow breakdown, prompt design needed for each Agent module, and evaluation principles for each module. But we haven't yet covered specific implementation methods.
Actually, I've already written a deep research Demo project at:
https://github.com/codemilestones/Fairy
The code inside basically implements the two stages of Scope and Research from the tutorial. The difference is that I've rewritten all prompts and code comments in Chinese for easier reading.
Later, we'll continue diving into this project and actually see the implementation effects of the deep research Agent.
If you're also interested in actually building an industrial-grade Agent, follow me — I'll continue sharing more practical Agent experience.
References
The following articles are all cited in the course and are very inspiring. I recommend reading them.
[1] Deep Research from Scratch: https://github.com/langchain-ai/deep_research_from_scratch
[2] LLM Judge: https://hamel.dev/blog/posts/llm-judge/index.html
[3] Workflow and Agents: https://docs.langchain.com/oss/python/langgraph/workflows-agents#agent
[4] The "think" tool: https://www.anthropic.com/engineering/claude-think-tool
[5] Context engineering for agents: https://blog.langchain.com/context-engineering-for-agents/