Tokenization to Transfer: Do Genomic Foundation Models Learn Good Representations?

Apr 7

ADIA Lab Research Paper Series

Authors:
Kirill Vishniakov, Karthik Viswanathan, Aleksandr Medvedev, Praveenkumar Kanithi, Marco AF Pimentel, Ronnie Rajan, Shadab Khan

Date Published: March 25, 2026

ADIA Lab, with collaborators, has an ICLR 2026 main conference paper: Tokenization to Transfer: Do Genomic Foundation Models Learn Good Representations?

Our Health AI research lead Shadab Khan is the senior author on the study.

Genomic foundation models are meant to learn from large-scale DNA data and then serve as better starting points for new tasks, or as useful feature extractors. This paper tests that assumption directly.

Across 52 tasks and multiple evaluation settings, the study found that, on the tasks examined, current pretraining methods often gave only modest gains over identical models starting from random weights.

The results also showed that performance depends strongly on how DNA is tokenized, and that current models remain limited in their sensitivity to clinically relevant genetic variants.

Why this matters - if genomic foundation models are to become reliable tools for biology and medicine, the field may need to rethink how these models are built, trained, and evaluated.

Download this paper