Overview
Software Infrastructure for Life Sciences
by Shelby Newsad
Digitizing the stack of biotech
Digitization of biotech is happening on many layers - via new instrumentation (liquid handlers, robotic arms), better experimental modalities (sequencing and imaging vs. biochemical), and through expedited literature scanning and coding techniques (Perplexity, ChatGPT, Co-pilot). The collective force of these actions is making science more effective and impactful as a practice. We are breaking this down into two thematic practices which are 1) bringing scientists up to a modern tech stack and 2) enabling new research opportunities for scientists.
With the first theme of (bio)tech modernization, we discuss data management tools and how these tie into ELN and LIMS systems and workflow management for the large datasets being produced by genomics and structural biology.
On the second front, new technologies have enabled vast datasets of new data modalities (cell painting and sequencing). These modalities provide new ways to probe mechanisms but require new skills. However, the rate of learning new skills has increased drastically with LLMs. In this case the scientist is a hardware and software engineer, not necessarily a biologist. Not only this but new modes of research are possible with LLMs linked to lab automation tech enabling closed loop, autonomous experimental systems. Beyond that AI generated data (such as protein structures) has added color to academic investigations related to health and disease.
Taken together we believe the impacts of this confluence of innovation will speed up innovation cycles drastically and allow for new disciplines and modes of experimentation.
Paper one
Considerations for implementing electronic laboratory notebooks
by stuart g. higgins, akemi a. nogiwa-valdez & molly m. stevens
Impact
This manuscript discusses how research data has increasingly become digitized. Despite having made novel discoveries, scientists have struggled to transition to modern methods of data collection. In fact, use of tools like electronic lab notebooks has become a necessity due to large amounts of data produced in the lab and exponential increase in complexity. Despite the benefit of digital tools - research orgs need to choose the right system and there are many practical and business considerations to be made here. This paper summarized the motivations behind digitizing lab records and how it can pave the way for future benefits like automation and easier use of LLMs.
Methods and results
Authors did a thorough review of motivations and personas involved in setting up a digital system for experimental data.
Although this publication was originally designed for academic labs, there are several parallels that can be drawn in terms of specifications for small and mid-sized biotech and biopharma companies. The key points researched in this article include:
- Importance of user requirements and ways to not make digitizations a bottleneck for scientists
- Improving research reproducibility and long term data access
- Considerations for data integrity, security and compliance with regulations
- The financial and time investments required to digitize a lab. Balanced against the future benefits like using technologies such as blockchain and peer-to-peer networking to aid accountability or reduce the reliance on only one repository for long-term data storage. Or integration of an ELN with computational semantic technologies, which allow the meaning of human language to be automatically inferred.
Paper two
Self-driving laboratories to autonomously navigate the protein fitness landscape
BY Jacob T. Rapp, Bennett J. Bremer & Philip A. Romero
Impact
Removing humans from the direct execution of experiments can allow for increased reproducibility and throughput via higher levels of automation. Further functionality which automatically feeds results into models which synthesize the information and provide hypothesis generation for next steps again creates a system which could speed up the iteration cycles of experimentation. This paper is the convergence of the above functionality executed to optimize the stability of proteins, an important functionality and a taste of what the future of scientific research looks like.
Methods and results
Self-driving Autonomous Machines for Protein Landscape Exploration (SAMPLE) is a landmark study which integrates lab automation, machine learning, and reasoning to create closed loop protein optimization. Bayesian optimization was part of an AI agent to help discover new proteins which were directly tested in the experimental environment by assembling pre-synthesized DNA fragments, expressing proteins, testing the protein thermostability. The functional data is fed back into the agent to identify the gene assembly success and thermostability which is automatically analyzed, and integrated into a database. This collective information is used to develop a higher granularity map of protein fitness landscape and determine higher potential protein mutations for optimized thermostability.
Thermostability of the protein of interest (glycoside hydrolase) was increased by 12°C. Future work could use this same platform to optimize other proteins or even work through different functionality (protein function, binding, substrate, and others).
Paper three
Emergent autonomous scientific research capabilities of LLMs
by daniil a. boiko, robert macknight, and gabe gomes
Impact
LLMs have been shown to be helpful in summarizing literature, expanding on possibilities for next experiments, and pressure testing the validity of certain hypotheses. There’s been an emergence of LLMs which take the field a step further by making an intelligent agent which designs, plans, and executes scientific experiments. This is important because it takes the ‘thinking’ aspect of science from the scientist.
Methods and results
A smart entity made of GPT-3.5 and 4 backends were made into an AI agent. This agent had the ability to plan experiments, code, search the web, and automate lab hardware for the purpose of experiment completion. Pertinent webpages were discovered through vector embeddings. Once hypothesis data was aggregated, the AI agent made protocols with natural language prompts to make protocols which were then executed via direct pipetting by the system. Through this process two different chemical reactions (Suzuki and Sonogashira) were executed by the system. The upshot of this piece is that an AI agent was able to autonomously reason, code, synthesize information, and control hardware to test the formed hypotheses. The ability of the system to potentially be used in dangerous ways (dual-use) was tested and was able to be controlled. As these systems become more sophisticated (with better reasoning and more reactions), there’s a need to continue improving safety checks.
Paper four
Scipion3: A workflow engine for cryo-electron microscopy image processing and structural biology
by pablo conesa, yunior c. fonseca, jorge jiménez de la morena, grigory sharov, jose miguel de la rosa-trevín, ana cuervo, alberto garcía mena, borja rodríguez de francisco, daniel del hoyo, david herreros, daniel marchan, david strelak, estrella fernández-giménez, erney ramírez-aportela, federico pedro de isidro-gómez, irene sánchez, james krieger, josé luis vilas, laura del cano,marcos gragera,mikel iceta,marta martínez, patricia losana, roberto melero, roberto marabini, josé maría carazo and carlos oscar sánchez sorzano
Impact
The software developed in this paper is the synthesis point for the integration of different imaging software (structural biology and microscopy). Moreover, its benefits encompass important axes not typically optimized for in the academic context such as data traceability, analysis and results reproducibility and versatility through interoperability.
Methods and results
The Scipion workflow engine was initially developed in 2013 was recently updated to encompass many domains in computational image analysis. These domains include single particle analysis by Cryo-EM as well as atomic modeling, tomography, microED, virtual drug screening, and microscopy. In addition to the wide set of use-cases, the featureset of Scipion is vast - including interoperability, data provenance tracking, reproducibility, and wrangling large datasets.
Interoperability is achieved by making universal data objects which standardizes metadata combining operations for over 1000 methods and 60 software packages. In addition to the aforementioned points, the Scipion3 allows for HPC integration, user authentication, scripting capabilities, and flexible visualization. Taken together this workflow engine is an aggregating force which engenders best practices by scientists in structural biology.
Paper five
Nextflow in Bioinformatics: Executors Performance Comparison Using Genomics Data
by viktória spišaková, lukáš hejtmánek, jakub hynšt
Impact
Many modern bioinformatic workflows require complex dependencies which need to be coordinated with high-performance compute (HPC) clusters. There’s tension between the life sciences movement to use containers (which take care of the workflow dependencies) and marrying these with significant performance. This paper compares workflows on various infrastructures and discusses the contextual pros and cons depending on the context.
Methods and results
Real world data from the Czech Genome project was analyzed by HPC systems in batch-process/scheduling system OpenPBS and container platform Kubernetes which were then compared to local compute. Historically there’s been debate about the best practices given that the standard scheduling systems like OpenPBS have poor user experience for many scientists and struggle with reproducibility. In contrast, Kubernetes was designed for web applications, not HPC, and struggle with high failure rates.
The results from assessing the Czech Genome Project using a Nextflow bioinformatics pipeline Sarek showed that both Kubernetes and OpenPBS performed similarly for large datasets, with Kubernetes being more stable and slightly faster. For this reason, the authors see containerized systems such as Kubernetes as the next generation of computing infrastructure. For the analysis of smaller datasets, local machines had similar computational time and user experience with no significant benefits observed for using large infrastructures.
Paper six
The impact of AlphaFold Protein Structure Database on the fields of life sciences
by mihaly varadi, sameer velankar
Impact
As more protein folding models come online, there will be a growing number of repositories such as the Alphafold Protein Structure Database and the ESM metagenomic atlas which serve as a public good for the scientific community. Visibility of these datasets and examples of downstream research are important for enforcing continued funding and investigations to further bioscience research.
Methods and results
214 million highly accurate protein structure predictions have been successively added to the Alphafold Protein Structure Database. Moreover, mainstream protein repositories such as UniProt, InterPri, and Protein Databank (PDB) have integrated these proposed structures where experimental data has not yet been collected. This impact of this profound corpus of protein structures has been expanded by bioinformaticians who performed analyses to find novel binding sites which could be used for ligand binding and furthering sequence annotation. Structures from the database have been combined with experimental data to future characterize difficult complexes such as the human nuclear pore complex. What’s more is that virtual screens and docking studies have been performed for proteins relevant to tuberculosis, colitis, and vaccine development. Having a centralized and accessible database of SOTA models is an important and ever-growing resource for the bioscience community.