
In Polly, we have multiple services like Polly Pipelines, Atlas, and Knowledge Graphs. We wanted a single chat interface through which a user could communicate with all these services. So this was the requirement: a unified, intelligent chat interface that can seamlessly communicate with different services. This led us to build a multi-agent communication system.
What we thought of was building a hierarchical system where our chat system internally has a supervisor agent that communicates with different specialized agents. When we say specialized agent, we mean each individual agent that specializes in one service. For example, we built a KG agent (Knowledge Graph agent), an Atlas agent that specializes in Atlas services, and a Pipeline agent for pipelines.

Now, before we dive deep into it, it is important to understand what an agent is. It is the simplest unit here. The anatomy of an agent mainly contains three things:
.png)
These components are usually combined using agentic frameworks which help glue everything together.
We chose LangGraph as our agentic framework. Before finalizing it, we evaluated various frameworks available in the industry, such as LangGraph, Pydantic AI, CrewAI, etc. We did POC with these frameworks, and after that, we felt LangGraph was the right choice for us because it has been around for some time, provides rich out-of-the-box features, is flexible, and is widely used, making it reasonably battle-tested.

To create such a hierarchical system, traditionally, a graph-based structure was used. The problem with these traditional systems was that they required significant manual effort, had rigid structures, and lacked intelligence. Each node had to be explicitly coded, making the system harder to scale and evolve.

We chose to approach this differently. The turning point was the improvement in LLM capabilities over the last few years. Models like GPT and Claude have become very strong at reasoning and decision-making. They can now understand user prompts, interpret them, decompose them, and route them to the correct specialized agent. This level of intelligent routing was not realistically possible earlier.
LangGraph provides something called a supervisor agent, and when combined with the latest GPT models, it showed promising results. It was able to reason well, route intelligently, and integrate seamlessly with specialized agents while providing a rich set of features.
When we actually implemented this system (chat → supervisor → specialized agents → supervisor → user), we encountered a few challenges.
One issue was information inconsistency. Sometimes, even after explicitly mentioning in the system prompt not to omit information, the supervisor agent would occasionally drop information coming from specialized agents. We were able to improve this to some extent through system prompt refinement.
Another issue was format alterations. Since multiple agents are involved, format consistency becomes important. Sometimes agents did not strictly follow the expected format. We improved this by tightening prompts and enforcing stricter formatting expectations so responses matched what the frontend expected.
The third issue was slow response time. Since multiple LLM operations are involved (user → supervisor → specialized agent → supervisor → user), these multiple hops increased latency.
Another aspect we had to handle was conversation history. LangGraph checkpointers helped us manage this. They act as connectors between the agent and PostgreSQL. The conversations are stored in PostgreSQL while LangGraph manages the conversational state.
For monitoring and debugging, we used LangSmith, which worked well for tracing and observing agent behavior.
From an infrastructure perspective, we decided early on that we wanted a low-maintenance and extensible setup. We did not want something that required heavy active maintenance. After some discussion around server vs serverless approaches, we chose AWS serverless since it gave us scalability with lower operational overhead.
Our setup included:
We built an event-driven system where prompts flow through SQS into Lambda. Since we already have multiple services and expect more in the future, we standardized the agent structure. We decided on a single monorepo containing all agentic code and defined a common structure that must be followed to create and integrate agents. We also wrote documentation so individual agent developers could quickly build and integrate their agents.
We also built shared utilities so specialized agent developers would not need to worry about common concerns like:
This reduced their workload significantly.

Another practical problem was handling long-running processes. For example, when a user submits a pipeline through the pipeline agent, it can take time for results to be generated. To handle this, we built an asynchronous event-driven setup. Using EventBridge, we capture events generated later and inject them back into the conversation so users can receive updates asynchronously.
After deploying the system to production, it worked reasonably well overall, but we did face several issues. One of the most heavily used parts of the system was the Knowledge Graph (KG) agent, and we encountered several challenges there. The following section describes those learnings.
Knowledge Graph (KG) agents allow users to query complex graph data using natural language. However, building a reliable KG agent requires solving challenges around intent understanding, entity normalization, query generation, and result processing.
The first step in a KG agent pipeline is understanding the user’s query. The agent first identifies whether the query requires query execution, summarization, or general explanation, since not every question requires a database query.
It then maps the intent to the graph schema by identifying relevant node types, relationships, and properties so that the generated query uses the correct schema structure.
The agent also extracts entities from the question. These may include gene names, diseases, proteins, cohorts, or biological processes. However, user-provided entities often do not exactly match graph identifiers, which introduces another challenge.
To solve this, the agent performs entity normalization using strategies such as:
Candidate entities are then ranked to select the best match.
Once entities and schema mappings are identified, the agent constructs an execution plan describing the graph traversal strategy, node types, relationship directions, and filters. Based on this, the Cypher query is generated. Optionally, this plan can also be shown to users before execution for transparency.
Once finalized, the query is executed against the knowledge graph. This typically involves executing the query, polling status, retrieving results, and then processing and summarizing them into a human-readable response.
Query failures are inevitable due to incorrect schema assumptions, relationship direction errors, or Cypher syntax issues. To address this, we implemented automatic query repair. If execution fails, the system analyzes the error, corrects the query internally, and retries execution. The failure reason can also be reported for transparency.
One major challenge was the initial hierarchical agent architecture, which caused context loss, missing query IDs, formatting issues, and sometimes dropped results during inter-agent communication. This required redesigning how communication and state management worked.
Another challenge was schema awareness. Large knowledge graphs have complex schemas, making it difficult for the agent to understand relationships, properties, and directionality. We addressed this by compacting the schema and enriching it with descriptions and examples. This also reduced token usage and improved query generation accuracy.
We also faced entity detection conflicts due to inconsistent user inputs like disease, Disease, or DISEASE. This was handled through normalization and case-insensitive matching.
Another related challenge was entity normalization, since direct Cypher queries often failed due to identifier mismatches. We introduced exploratory queries with multiple matching strategies and candidate ranking. Even failures in these exploratory queries were analyzed and retried.
Handling large results was another challenge. Sometimes, even 100 rows exceeded the token limits. We addressed this using row sampling, column pruning, and recursive summarization.
Several architectural improvements increased system accuracy:
These improvements mainly came from better schema awareness, stronger normalization, improved planning, and automatic query repair.
Some areas we plan to improve include: