How to read a FASTA file
Published in
2 min readDec 7, 2023
A Rust function that returns a HashMap containing sequence IDs or headers as keys and DNA sequences as values extracted from a FASTA file
Arguments
- file_path: a string slice that holds path to the text file (in FASTA format)
Example
>Header_000
ACTG
>Header_001
AAAT
// Assume the text file above is named "sample_file.fa"
let data = read_fasta("sample_file.fa");
// data holds
// {"Header_001": "AAAT", "Header_000": "ACTG"}
Code
use std::collections::HashMap;
use std::fs::File;
use std::io::{BufRead, BufReader};
fn read_fasta(file_path: &str) -> HashMap<String, String> {
let mut data = HashMap::new();
let file = File::open(file_path).expect("Invalid filepath");
let reader = BufReader::new(file);
let mut seq_id = String::new();
for line in reader.lines() {
let line = line.unwrap();
// Check if the line starts with '>' (indicating a sequence ID or header)
if line.starts_with('>') {
seq_id = line.trim_start_matches('>').to_string();
} else {
// If it's a DNA sequence line, insert or update the HashMap entry
// If seq_id is not present, insert a new entry with an empty String
// Then append the current line to the existing DNA sequence
data.entry(seq_id.clone()).or_insert_with(String::new).push_str(&line);
}
}
data
}
Some notes on the code
trim_start_matches('>')
removes all>
from the start ofseq_id
- It seems like
BufRead
is not used anywhere, butreader.lines()
needs this trait in scope to create an iterator - Note that you can only push a
&str
to an existingString
(you can’t append/concatenate aString
with anotherString
) data.entry()
will take ownership ofseq_id
and make it invalid if we don’t clone it
Next Steps
- You can set up Rust in your JupyterLab environment with this article
- You can download the Jupyter notebook for this post and play with the code
- You can download the sample FASTA file too