Polly

Polly Co-Scientist: A Multi-Agent AI System for Scientific Workflows

High-Level Architecture for CDMO Capacity Modeling

In Polly, we have multiple services like Polly Pipelines, Atlas, and Knowledge Graphs. We wanted a single chat interface through which a user could communicate with all these services. So this was the requirement: a unified, intelligent chat interface that can seamlessly communicate with different services. This led us to build a multi-agent communication system.

What we thought of was building a hierarchical system where our chat system internally has a supervisor agent that communicates with different specialized agents. When we say specialized agent, we mean each individual agent that specializes in one service. For example, we built a KG agent (Knowledge Graph agent), an Atlas agent that specializes in Atlas services, and a Pipeline agent for pipelines.

Polly Co-Scientist as a supervisor agent routing tasks to Atlas, KG, and Pipelines agents.

Now, before we dive deep into it, it is important to understand what an agent is. It is the simplest unit here. The anatomy of an agent mainly contains three things:

Anatomy of an agent –

Agent Anatomy: model, system prompt, and tools for interacting with Atlas APIs.
  • Model – which LLM we use. There are various options like OpenAI GPT models, Anthropic Claude, and open-source models like Llama. When we initially built this system, o3 was among the most intelligent models available. Now we are using the latest GPT-5 models.
  • System prompt – instructions defining what the agent does, its rules, and its boundaries. For example, the KG agent system prompt has details about knowledge graphs.
  • Tools – functions that give access to their respective services or APIs. For example, the Atlas agent tools provide access to Atlas APIs.

These components are usually combined using agentic frameworks which help glue everything together.

We chose LangGraph as our agentic framework. Before finalizing it, we evaluated various frameworks available in the industry, such as LangGraph, Pydantic AI, CrewAI, etc. We did POC with these frameworks, and after that, we felt LangGraph was the right choice for us because it has been around for some time, provides rich out-of-the-box features, is flexible, and is widely used, making it reasonably battle-tested.

Evaluated agentic frameworks

To create such a hierarchical system, traditionally, a graph-based structure was used. The problem with these traditional systems was that they required significant manual effort, had rigid structures, and lacked intelligence. Each node had to be explicitly coded, making the system harder to scale and evolve.

An agent flow deciding to continue actions or end the process.

We chose to approach this differently. The turning point was the improvement in LLM capabilities over the last few years. Models like GPT and Claude have become very strong at reasoning and decision-making. They can now understand user prompts, interpret them, decompose them, and route them to the correct specialized agent. This level of intelligent routing was not realistically possible earlier.

LangGraph provides something called a supervisor agent, and when combined with the latest GPT models, it showed promising results. It was able to reason well, route intelligently, and integrate seamlessly with specialized agents while providing a rich set of features.

When we actually implemented this system (chat → supervisor → specialized agents → supervisor → user), we encountered a few challenges.

One issue was information inconsistency. Sometimes, even after explicitly mentioning in the system prompt not to omit information, the supervisor agent would occasionally drop information coming from specialized agents. We were able to improve this to some extent through system prompt refinement.

Another issue was format alterations. Since multiple agents are involved, format consistency becomes important. Sometimes agents did not strictly follow the expected format. We improved this by tightening prompts and enforcing stricter formatting expectations so responses matched what the frontend expected.

The third issue was slow response time. Since multiple LLM operations are involved (user → supervisor → specialized agent → supervisor → user), these multiple hops increased latency.

Another aspect we had to handle was conversation history. LangGraph checkpointers helped us manage this. They act as connectors between the agent and PostgreSQL. The conversations are stored in PostgreSQL while LangGraph manages the conversational state.

For monitoring and debugging, we used LangSmith, which worked well for tracing and observing agent behavior.

From an infrastructure perspective, we decided early on that we wanted a low-maintenance and extensible setup. We did not want something that required heavy active maintenance. After some discussion around server vs serverless approaches, we chose AWS serverless since it gave us scalability with lower operational overhead.

Our setup included:

  • AWS Lambda for compute
  • SQS for event flow
  • Amazon RDS for storage
  • Other supporting AWS services

We built an event-driven system where prompts flow through SQS into Lambda. Since we already have multiple services and expect more in the future, we standardized the agent structure. We decided on a single monorepo containing all agentic code and defined a common structure that must be followed to create and integrate agents. We also wrote documentation so individual agent developers could quickly build and integrate their agents.

We also built shared utilities so specialized agent developers would not need to worry about common concerns like:

  • Sending messages to users
  • Conversation streaming
  • WebSocket handling

This reduced their workload significantly.

Standardized agent folder structure and code format for integrating agents.

Another practical problem was handling long-running processes. For example, when a user submits a pipeline through the pipeline agent, it can take time for results to be generated. To handle this, we built an asynchronous event-driven setup. Using EventBridge, we capture events generated later and inject them back into the conversation so users can receive updates asynchronously.

After deploying the system to production, it worked reasonably well overall, but we did face several issues. One of the most heavily used parts of the system was the Knowledge Graph (KG) agent, and we encountered several challenges there. The following section describes those learnings.

Anatomy of a Knowledge Graph Agent

Knowledge Graph (KG) agents allow users to query complex graph data using natural language. However, building a reliable KG agent requires solving challenges around intent understanding, entity normalization, query generation, and result processing.

Intent Parsing

The first step in a KG agent pipeline is understanding the user’s query. The agent first identifies whether the query requires query execution, summarization, or general explanation, since not every question requires a database query.

It then maps the intent to the graph schema by identifying relevant node types, relationships, and properties so that the generated query uses the correct schema structure.

The agent also extracts entities from the question. These may include gene names, diseases, proteins, cohorts, or biological processes. However, user-provided entities often do not exactly match graph identifiers, which introduces another challenge.

To solve this, the agent performs entity normalization using strategies such as:

  • Exact matching
  • Fuzzy matching
  • Contains matching
  • Abbreviation expansion

Candidate entities are then ranked to select the best match.

Execution Plan Generation

Once entities and schema mappings are identified, the agent constructs an execution plan describing the graph traversal strategy, node types, relationship directions, and filters. Based on this, the Cypher query is generated. Optionally, this plan can also be shown to users before execution for transparency.

Query Execution

Once finalized, the query is executed against the knowledge graph. This typically involves executing the query, polling status, retrieving results, and then processing and summarizing them into a human-readable response.

Failure Handling

Query failures are inevitable due to incorrect schema assumptions, relationship direction errors, or Cypher syntax issues. To address this, we implemented automatic query repair. If execution fails, the system analyzes the error, corrects the query internally, and retries execution. The failure reason can also be reported for transparency.

Key Challenges

One major challenge was the initial hierarchical agent architecture, which caused context loss, missing query IDs, formatting issues, and sometimes dropped results during inter-agent communication. This required redesigning how communication and state management worked.

Another challenge was schema awareness. Large knowledge graphs have complex schemas, making it difficult for the agent to understand relationships, properties, and directionality. We addressed this by compacting the schema and enriching it with descriptions and examples. This also reduced token usage and improved query generation accuracy.

We also faced entity detection conflicts due to inconsistent user inputs like disease, Disease, or DISEASE. This was handled through normalization and case-insensitive matching.

Another related challenge was entity normalization, since direct Cypher queries often failed due to identifier mismatches. We introduced exploratory queries with multiple matching strategies and candidate ranking. Even failures in these exploratory queries were analyzed and retried.

Handling large results was another challenge. Sometimes, even 100 rows exceeded the token limits. We addressed this using row sampling, column pruning, and recursive summarization.

Accuracy Improvements

Several architectural improvements increased system accuracy:

  • With supervisor architecture → ~70%
  • After removing supervisor → ~91%

These improvements mainly came from better schema awareness, stronger normalization, improved planning, and automatic query repair.

Future Improvements

Some areas we plan to improve include:

  • Supporting long-term multi-turn conversations
  • Improving result summarization
  • Improving query speed and latency
  • Better entity resolution
  • Improving overall reasoning capability

Blog Categories

Talk to our Data Expert
Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.
Oops! Something went wrong while submitting the form.

Watch the full Webinar

Blog Categories