Leveraging Large Language Models for Metagenomic Analysis

Model Overview: This model builds on the RoBERTa architecture with a similar approach to our paper titled "Leveraging Large Language Models for Metagenomic Analysis." The model was trained for one epoch on V100 GPUs.

Model Architecture:

  • Base Model: RoBERTa transformer architecture
  • Tokenizer: Custom K-mer Tokenizer with k-mer length of 6 and overlapping tokens
  • Training: Trained on a diverse dataset of 220 million 400bp fragments from 18k genomes (Bacteria and Archaea))

Steps to Use the Model:

  1. Install KmerTokenizer:

  2. pip install git+https://github.com/MsAlEhR/KmerTokenizer.git
    
  3. Example Code:

     from KmerTokenizer import KmerTokenizer
     from transformers import AutoModel
     import torch
     
     # Example gene sequence
     seq = "ATTTTTTTTTTTCCCCCCCCCCCGGGGGGGGATCGATGC"
     
     # Initialize the tokenizer
     tokenizer = KmerTokenizer(kmerlen=6, overlapping=True, maxlen=400)
     tokenized_output = tokenizer.kmer_tokenize(seq)
     pad_token_id = 2  # Set pad token ID
     
     # Create attention mask (1 for tokens, 0 for padding)
     attention_mask = torch.tensor([1 if token != pad_token_id else 0 for token in tokenized_output], dtype=torch.long).unsqueeze(0)
     
     # Convert tokenized output to LongTensor and add batch dimension
     inputs = torch.tensor([tokenized_output], dtype=torch.long)
     
     # Load the pre-trained BigBird model
     model = AutoModel.from_pretrained("MsAlEhR/MetaBerta-400-fragments-18k-genome", output_hidden_states=True)
     
     # Generate hidden states
     outputs = model(input_ids=inputs, attention_mask=attention_mask)
     
     # Get embeddings from the last hidden state
     embeddings = outputs.hidden_states[-1]  
     
     # Expand attention mask to match the embedding dimensions
     expanded_attention_mask = attention_mask.unsqueeze(-1) 
     
     # Compute mean sequence embeddings
     mean_sequence_embeddings = torch.sum(expanded_attention_mask * embeddings, dim=1) / torch.sum(expanded_attention_mask, dim=1)
    

Citation: For a detailed overview of leveraging large language models for metagenomic analysis, refer to our paper:

Refahi, M.S., Sokhansanj, B.A., & Rosen, G.L. (2023). Leveraging Large Language Models for Metagenomic Analysis. IEEE SPMB.

Refahi, M., Sokhansanj, B.A., Mell, J.C., Brown, J., Yoo, H., Hearne, G. and Rosen, G., 2024. Scorpio: Enhancing Embeddings to Improve Downstream Analysis of DNA sequences. bioRxiv, pp.2024-07.

Downloads last month
3
Inference API
Unable to determine this model's library. Check the docs .