
Whether you are looking for an autonomous scientist to architect a CRISPR screen or a reliable digital analyst to process a spatial transcriptomics pipeline, the rise of "BioAgents" is rapidly shifting the landscape of drug discovery. No longer just simple LLMs, these agents now integrate massive biological data lakes with specialized toolsets to perform complex reasoning. But as the field crowds with contenders, a critical question remains: which of these agents can actually do the math, and which ones are just "hallucinating" a PhD? In this deep dive, we benchmark three of the industry’s most talked-about AI Agents, Stanford’s broad-spectrum Biomni, Genentech’s precision-focused SpatialAgent, and the workflow-centric Polly BioAgent to see how they handle real-world spatial analysis and literature mining tasks.

To benchmark these agents, two tasks were designed that test the boundaries of their capabilities, testing both their mathematical execution and their biological reasoning.
Prompt: Download the 10x Genomics Visium spatial transcriptomics dataset for the adult mouse brain (sample_id='V1_Adult_Mouse_Brain'). Once loaded, perform spatial clustering to identify distinct anatomical tissue domains. Generate a spatial scatter plot of the tissue coordinates colored by these identified clusters. Next, calculate spatially variable genes and infer ligand-receptor cell-cell communication networks between the distinct spatial domains using an appropriate Python spatial library (like Squidpy or CellPhoneDB). Output a summary report of the top 3 most active signaling pathways and save the interaction matrix to cci_matrix.csv.
Rationale: This task was selected to test the agents' capacity to handle specialized, multimodal data. Spatial transcriptomics requires integrating abstract gene expression matrices with physical tissue coordinates. This exposes whether an agent actually understands domain-specific bioinformatics methodologies (e.g., using spatial autocorrelation metrics like Moran's I) or if it lazily applies generic data science approximations (e.g., standard Coefficient of Variation) to complex biological problems.
Prompt: Take the following list of upregulated genes in Alzheimer's disease: APOE, TREM2, CD33, and CLU. Programmatically query biological databases to identify enriched GO terms and KEGG pathways for this specific gene set. Based on the identified pathways, suggest 3 FDA-approved drugs that could potentially be repurposed to target this signaling cascade. Provide a detailed markdown report of your reasoning, including database IDs or literature references to support your hypothesis.
Rationale: Bioinformatics requires synthesizing results into actionable biological insights. This task was selected to isolate the agents' Retrieval-Augmented Generation (RAG) and reasoning capabilities. By forcing them to cite specific literature and database IDs, this test acts as a strict evaluation of an agent's factual grounding versus its susceptibility to hallucinating fake scientific citations.

Also read: CellAtria vs Polly BioAgent, an in-depth comparison exploring how ingestion-focused agents differ from autonomous bioinformatics workbenches, and what that means for real-world single-cell analysis workflows.