Skip to main content

Imageomics Foundation Model

photo of a bird next to an image of a taxonomy pyramid, next to a photo of a cluster of bird images

Project Members

Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn,Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su

Project Goals

To develop a foundation model for image-based biology tasks. Foundation models are large-scale neural networks pre-trained on a large amount of data (e.g., millions of animal images), which can then be used in a wide range of downstream tasks. Potential applications include species categorization, trait segmentation, individual identification, video analysis, and more.

Project Overview

The team introduced a novel learning objective to integrate biological taxonomies into the training of new foundation models for biology to be used in future studies. We are in the process of conducting extrinsic evaluation on a wide range of biology + machine learning tasks such as image classification, object detection, and segmentation. The foundation model team has been focusing on three main threads: incorporating biology taxonomy into foundation model training, new architecture for foundation models, and model interpretation.

We developed BioCLIP, a vision foundation model for the tree of life. Trained on over 10 million organism images (the TreeOfLife-10M dataset) with the tree of life taxonomy integrated in a novel way, BioCLIP has captured an internal representation that conforms to the tree of life, and proves to be a strong foundation model for a wide range of organisms. We are continuing to develop stronger foundation models for biology and exploring interesting applications to answer various biology questions.

The process of using a foundation model to classify species based on images is shown. Which is explained in more detail in text.
(a)Two taxa, or taxonomic labels, for two different plants, Onocleasensibilis (d) and Onocleahintonii (e). These taxa are identical except for the species. (b)The auto regressive text encoder naturally encodes the hierarchical structure of the taxonomy. See how the Order token(s) (Polypodiales) can incorporate information from the Kingdom, Phylum and Class tokens, but nothing later in the  hierarchy. This helps align the visual representations to this same hierarchical structure. (c)These hierarchical representations of taxonomic labels are fed into the standard contrastive pre-training objective and are matched with image representations (d)and(e).

We also developed a novel interpretable machine learning model named INterpretable TRansformer (INTR). Through visualizing the attention maps in a way faithful to its predictions, we show that INTR can successfully identify and localize biologically meaningful traits for species recognition. We hope to continue to expand the use of INTR, and further develop the technology.  

Related Publications

Samuel Stevens*, Jiaman Wu*, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su. BioCLIP: A Vision Foundation Model for the Tree of Life. arXiv preprint.

Dipanjyoti Paul, Arpita Chowdhury, Xinqi Xiong, Feng-Ju Chang, David Carlyn, Samuel Stevens, Kaiya Provost, Anuj Karpatne, Bryan Carstens, Daniel Rubenstein, Charles Stewart, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao. A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis. International Conference on Learning Representations, 2024.

Tools & Repositories

Overview of BioCLIP


BioCLIP (model and code)