ChatGPT for bioinformatics

Matthew Moore
8 min readFeb 14, 2023

--

There has been a lot of excitement surrounding ChatGPT, OpenAI’s latest public release. The language model is also capable of generating code. Let’s investigate how well the model does in generating solutions for bioinformatics.

First, what it does well. Anecdotally, I’ve had much better results from python than for other languages due to the wealth of examples that the model has been trained on, extensive bioinformatics packages in python and possibly, ease of the language. In fact, it performs better for all interpreted languages compared with compiled languages as far as I can tell.

ChatGPT performs routine bioinformatics with ease

Human bioinformatician: “In python translate DNA into protein”

from Bio.Seq import Seq

# Define a DNA sequence
dna_seq = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")

# Translate the DNA sequence into protein
protein_seq = dna_seq.translate()

# Print the protein sequence
print(protein_seq)

Human bioinformatician: “Now BLAST the protein against a local database”

from Bio.Blast.Applications import NcbiblastpCommandline

# Define the path to the local BLAST database
db_path = "/path/to/local/blast/db"

# Define the path to the BLAST executable
blastp_path = "/path/to/blastp"

# Define the protein sequence to be queried
query_seq = ">query\n" + str(protein_seq)

# Define the BLAST command-line arguments
blastp_cline = NcbiblastpCommandline(
cmd=blastp_path,
query="-",
db=db_path,
evalue=0.001,
outfmt=6,
out="blast_results.txt"
)

# Execute the BLAST search
stdout, stderr = blastp_cline(stdin=query_seq)

Ok, it’s (they’re?) finding this a little too easy. Let’s try: “Now generate a network of all BLAST results, connecting components with >90% identity”

import networkx as nx

# Define the path to the BLAST output file
blast_results_file = "blast_results.txt"

# Define the identity threshold
identity_threshold = 90

# Create a new undirected graph
G = nx.Graph()

# Parse the BLAST results and add edges to the graph
with open(blast_results_file, "r") as f:
for line in f:
fields = line.strip().split("\t")
query = fields[0]
subject = fields[1]
identity = float(fields[2])
if identity >= identity_threshold:
G.add_edge(query, subject)

# Print the number of nodes and edges in the graph
print(f"Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}")

Very impressive so far. For someone starting out I would think this would be a useful supplement to their work and is quicker and easy than reading manuals or searching stackoverflow. For development I don’t know if it’s a good idea for bioinformaticians, but then we aren’t depriving ourselves of calculators or spell-checkers.

Code explanations

Human bioinformatician: “Generate k-mers in python. Use argparse to specify the size of k and the input multi fasta file”

from Bio import SeqIO
import argparse

# Define the command line arguments
parser = argparse.ArgumentParser(description='Generate k-mers from a multi-fasta file')
parser.add_argument('-k', '--kmer-size', type=int, default=3, help='Size of k-mers')
parser.add_argument('-i', '--input-file', required=True, help='Input multi-fasta file')
args = parser.parse_args()

# Define the k-mer size and input file
kmer_size = args.kmer_size
input_file = args.input_file

# Open the input file and generate k-mers for each sequence
with open(input_file, 'r') as fasta_file:
for record in SeqIO.parse(fasta_file, 'fasta'):
sequence = str(record.seq)
for i in range(len(sequence) - kmer_size + 1):
kmer = sequence[i:i+kmer_size]
print(kmer)

Nicely done. The explanation of the code by chatGPT was also useful: “In this example, we first import the SeqIO class from the Biopython library and the argparse module. We then define two command line arguments using the argparse.ArgumentParser class: kmer-size to specify the size of k, and input-file to specify the input multi-fasta file. We then parse the command line arguments using argparse.parse_args(). We then define the kmer_size and input_file variables based on the command line arguments. We then open the input file using open() and generate k-mers for each sequence using a nested for loop. The outer loop iterates over each sequence in the input file, and the inner loop generates k-mers for each sequence using string slicing. We then print each k-mer to the console using print().”

Similarly, you can provide chatGPT with your code and it will return an explanation. This function extracts flexible core sites from an alignment of bacterial genomes. Excluding those where there are gaps or ambiguous bases in >5% of the genomes:

'''
https://github.com/moorembioinfo/FlexCore
'''

def get_core(filename, percentcore, popsize):
indexdict = {}
chunklist = list(range(0, 2 * popsize, 2))

for coord in chunklist:

with open(filename) as filehandle:
key = filehandle.readlines()[coord].rstrip()
filehandle.seek(0)
seq = list(filehandle.readlines()[coord + 1].rstrip())

allpos = [i for i, val in enumerate(seq) if val in ("-", "N")]

for pos in allpos:
if pos in indexdict:
indexdict[pos] = indexdict.get(pos) + 1
else:
indexdict[pos] = 1

print(f"Sequence {key} added to core.")

threshold = round((popsize / 100) * percentcore)
cutoff = popsize - threshold
thresholdgapindex = [int(key) for key in indexdict if indexdict.get(key) > cutoff]

indexdict.clear()
return thresholdgapindex

ChatGPT responded to this code with a clear conception of what it does and some suggestions for improvement:

“This code defines a function called get_core that takes three arguments: filename (a file name), percentcore (an integer representing the percentage of the core genome), and popsize (an integer representing the population size).

The function reads in the data from the file, which is assumed to be in a specific format (alternating lines of sequence IDs and sequences). It then identifies positions in the sequences that are gaps or Ns and tallies the frequency of these positions across all sequences. The function then determines a cutoff based on the percentage of the core genome specified in percentcore and the population size specified in popsize. Positions with a frequency above the cutoff are added to a list, which is then returned by the function.

One potential issue with this function is that it reads in the entire file for each iteration of the outer loop. This could be inefficient for very large files or if the function is called multiple times with the same file. It would be more efficient to read in the file once and then iterate over the lines using a loop or a generator.

Another potential issue is that the function assumes that the input file is in a specific format. If the file format changes, the function may not work correctly. It would be good to add some error handling to check that the file format is as expected, and to raise an error if it is not.

Overall, the function appears to be designed to identify the core genome of a population of sequences based on a certain percentage cutoff. With some improvements, it could be a useful tool for this purpose.”

I really like this feature. This could definitely be useful to help articulate what your code does to other researchers.

Generating educational tools?

To quickly develop apps or figures for educational purposes, chatGPT does an excellent job:

Human bioinformatician: “Write a shiny app to simulate population genetics, taking selection and population size as parameters”

The first chatGPT response use the R package ‘popG’. As I didn’t have this installed and wanted a quick example I said “Regenerate without using popG”. This returned an operational app but it cut off after a few generations. Next I tried: “Write a shiny app that simulates population genetics according to selection, drift and population size” and this was the result:

Not bad at all. I would definitely use this to explain selection, drift and the impact of population size.

Next I want to change the code to plot multiple alleles at the same time. I got a variety of errors, including basic coding errors with:“Write a shiny app that simulates population genetics according to selection, drift and population size for multiple alleles at once”, “Simulate allele frequencies over time shiny app, with multiple alleles”, “Simulate population genetics with multiple alleles in shiny”, “Shiny app that simulates population genetics for multiple alleles”, “Write R code with shiny that simulates population genetics for multiple alleles” and a number of other variations. I spent around 45 mins trying different commands, waiting for the chatGPT response and testing the apps. This is definitely a major downside, because unlike coding and even debugging this was not fun.

Conclusions, for now…

ChatGPT can definitely supplement your coding but I would not spend any amount of time tediously rewording your requests until success. It’s great for most small coding tasks, but so too is Google and stackoverflow.

Update, building a workflow for microbial genomics

I asked chatGPT to “Build an end to end bacterial genome analysis workflow in python” and this was the response:

“Building an end-to-end bacterial genome analysis workflow in Python involves multiple steps and can be quite complex, but here is an overview of the main steps involved:

  1. Data preprocessing: This involves quality control of raw sequencing data, trimming of adapter sequences, and filtering of low-quality reads. This can be done using tools such as Trimmomatic or FastQC.
  2. Genome assembly: This involves assembling the high-quality reads into contigs or scaffolds that represent the bacterial genome. This can be done using tools such as SPAdes or Velvet.
  3. Genome annotation: This involves identifying and annotating the genes, coding regions, and functional elements in the assembled genome. This can be done using tools such as Prokka or RAST.
  4. Comparative genomics: This involves comparing the annotated genome to other related genomes to identify differences and similarities, and to infer evolutionary relationships. This can be done using tools such as Roary or PhyloPhlAn.
  5. Visualization: This involves visualizing the results of the analysis in a meaningful way. This can be done using tools such as Circos or Gephi.

Here are some Python packages that can be used for each step:

  1. Data preprocessing: FastQC, Trimmomatic, Cutadapt, BBDuk.
  2. Genome assembly: SPAdes, Velvet, Canu.
  3. Genome annotation: Prokka, RAST, PATRIC.
  4. Comparative genomics: Roary, PhyloPhlAn, OrthoFinder.
  5. Visualization: Circos, Gephi, Biopython, Matplotlib.

Overall, building an end-to-end bacterial genome analysis workflow in Python requires a strong understanding of both bioinformatics and Python programming. There are many resources available online to help you learn the necessary skills and tools, such as online courses, tutorials, and documentation.”

I skipped step 5 and changed step 4 to Panaroo and asked “In python quality filter bacterial genomes with trimmomatic, assembly them with spades, annotate them with prokka and generate a pangenome with panaroo”. I followed up by asking chatGPT to also “Update to iterate over a directory of fastq files and pass each through the workflow” and finally “Add an argparse option for fastqdir and for number of threads, an integer to pass to the threads option of each software” yielding this

import os
import subprocess
import argparse

# Parse command-line arguments
parser = argparse.ArgumentParser(description='Bacterial genome analysis workflow')
parser.add_argument('fastq_dir', type=str, help='Path to directory containing input FASTQ files')
parser.add_argument('-t', '--num_threads', type=int, default=4, help='Number of threads for Trimmomatic, SPAdes, and Panaroo (default: 4)')
args = parser.parse_args()

# Loop through all FASTQ files in the directory
for fastq_file in os.listdir(args.fastq_dir):
if fastq_file.endswith('.fastq.gz'):
# Quality filter reads with Trimmomatic
sample_name = fastq_file.split('.')[0]
trimmomatic_output_prefix = os.path.join(args.fastq_dir, sample_name)
subprocess.run(['trimmomatic', 'PE', os.path.join(args.fastq_dir, fastq_file), trimmomatic_output_prefix + '_R1.trimmed.fastq.gz', trimmomatic_output_prefix + '_R1.unpaired.fastq.gz', trimmomatic_output_prefix + '_R2.trimmed.fastq.gz', trimmomatic_output_prefix + '_R2.unpaired.fastq.gz', 'SLIDINGWINDOW:4:20', 'MINLEN:50', '-threads', str(args.num_threads)])

# Assemble genome with SPAdes
spades_output_dir = os.path.join(args.fastq_dir, sample_name + '_assembly_output')
subprocess.run(['spades.py', '--careful', '--only-assembler', '-t', str(args.num_threads), '-o', spades_output_dir, '-1', trimmomatic_output_prefix + '_R1.trimmed.fastq.gz', '-2', trimmomatic_output_prefix + '_R2.trimmed.fastq.gz'])

# Annotate genome with Prokka
prokka_output_dir = os.path.join(args.fastq_dir, sample_name + '_annotation_output')
subprocess.run(['prokka', '--outdir', prokka_output_dir, '--prefix', sample_name, os.path.join(spades_output_dir, 'contigs.fasta')])

# Generate pangenome with Panaroo
panaroo_output_dir = os.path.join(args.fastq_dir, sample_name + '_panaroo_output')
subprocess.run(['panaroo', '-i', prokka_output_dir, '-o', panaroo_output_dir, '-t', str(args.num_threads)])

Certainly when I was starting in bioinformatics this would’ve taken me longer than a handful of minutes with chatGPT! Lastly, I have not tested this but you can also ask chatGPT to generate a conda environment file to install all the dependencies for this workflow. I doubt it checks whether they can be conda installed but it’s worth a try!

--

--

Matthew Moore

Researcher in pathogen genomics, science enthusiast, amateur social historian