Evo 2: Open Source AI Trained on 8.8 Trillion DNA Bases

Evo 2, an open source AI model trained on 8.8 trillion base pairs of DNA, can identify complex genomic features across all three domains of life, its developers announced. The system builds directly on its predecessor, Evo, which was limited to bacterial genomes, and extends that capability to eukaryotes, including organisms with genomes as large and structurally complex as the human genome.

Contents

Why Eukaryotic Genomes Are Harder to Read
How Evo 2 Was Built
What the Model Learned

The original Evo worked well with bacteria partly because bacterial genomes follow predictable rules. Related genes cluster together. Coding sequences run uninterrupted. Regulatory systems are compact. Eukaryotic genomes operate on no such tidy logic, and the question of whether the same approach could scale was left open. The team behind Evo decided to find out.

Why Eukaryotic Genomes Are Harder to Read

In eukaryotes, coding sections of genes are broken up by introns, stretches of DNA that encode nothing but must still be correctly identified and removed during gene expression. Regulatory sequences can be scattered across hundreds of thousands of base pairs. The molecular signals that define splice sites or protein binding locations are statistically fuzzy, not absolute, meaning a given position might only favor a particular base 45 percent of the time rather than always.

Layered on top of this is the vast amount of so-called junk DNA: remnants of inactive viruses, broken genes, and other genomic debris. Together, these features make eukaryotic genomes genuinely difficult to interpret, even with specialized tools that have been built specifically for the task.

Neural networks, by nature, are suited to extracting weak statistical signals from enormous datasets. The challenge was assembling training infrastructure capable of processing the sheer volume of sequence data required to make that work at the genome scale.

How Evo 2 Was Built

The architecture at the core of Evo 2 is a convolutional neural network called StripedHyena 2. Training proceeded in two stages. The first stage fed the model sequences in chunks of roughly 8,000 bases, focusing on local genomic features. The second stage extended the context window to one million bases at a time, allowing the model to learn large-scale structural patterns.

The training dataset, called OpenGenome2, contains 8.8 trillion bases drawn from bacteria, archaea, eukaryotes, and bacteriophages. The team deliberately excluded viruses that infect eukaryotes, citing concern that a model capable of designing novel eukaryotic-targeting viral sequences could be misused.

Two model versions were produced:

A 7 billion parameter model trained on 2.4 trillion bases
A 40 billion parameter model trained on the full OpenGenome2 dataset

What the Model Learned

After training, Evo 2 developed internal representations of genomic features that are notoriously difficult to identify computationally, including regulatory DNA regions and splice sites. These are elements that even purpose-built human tools identify with meaningful error rates, which becomes a serious problem at the scale of a 3 billion-base genome.

The underlying logic is that evolutionary conservation does much of the work. If a sequence element matters functionally, it will appear across many species in many genomic contexts. A model trained on enough examples will begin to recognize those patterns, even without being explicitly told what to look for.

By releasing Evo 2 as open source, the team is making the 40 billion parameter model and its training dataset publicly available, giving researchers direct access to a tool trained at a scale that few institutions could replicate independently.

Photo by L N on Unsplash

This article is a curated summary based on third-party sources. Source: Read the original article

Why Eukaryotic Genomes Are Harder to Read

How Evo 2 Was Built

More Read

What the Model Learned

All the latest Foxiz news straight to your inbox​

All the latest Foxiz news straight to your inbox