At the QBI Hackathon at UCSF, 2023, our team built something we’ve been wishing existed for a while: a way to automatically track biomolecular structures mentioned in preprints and surface them to the RCSB Protein Data Bank (PDB). In just 36 hours, we launched a fully working prototype — and it won us first place 🏆.

Why We Built StructHunt

Structural biology data is the backbone of modern life sciences. The success of AlphaFold2 and the urgency of the COVID-19 pandemic showed us how critical fast access to structural data is — whether for basic science or drug development.

But there’s a bottleneck: preprints on bioRxiv and medRxiv often describe integrative biomolecular structures that don’t get deposited into the PDB or PDB-Dev archive right away. That lag slows down discovery. StructHunt is designed to close that gap.

How it works

StructHunt is designed as a cloud-native, LLM-driven workflow that turns fresh preprints into actionable insights for the structural biology community. New papers from bioRxiv and medRxiv are automatically ingested and converted into vector embeddings using LLM-based representations. These embeddings are stored in Lantern, a scalable vector database, and searched with FAISS retrieval to rapidly isolate passages that mention integrative biomolecular structures. The retrieved text is then processed by ChatGPT-4.0, which extracts metadata, identifies structural details, and generates concise, human-readable summaries. These outputs flow into Google Docs for collaborative review and are distributed through automated email notifications to the RCSB PDB biocuration team, enabling faster prioritization of new findings. The entire system is hosted on AWS, giving us scalability, reliability, and a clear path to production-level deployment.

StructHunt started as a 36-hour sprint, but it points toward a bigger vision: AI-driven infrastructure that keeps scientific data as fresh as the discoveries themselves. GitHub repo

StructHunt team

StructHunt team at the QBI Hackathon at UCSF, 2023.