What is it like to work with data in the field of bioinformatics? (English)

Jéssica Costa
A Garota do TI
Published in
3 min readMar 2, 2023

One of the big questions that comes to my mind when someone asks me about what to study to work with data is what type of area/business the person will work in. As important as knowing techniques is understanding where these techniques will be applied. Each area has its characteristics and difficulties, it is important to be aware of this. I’m currently doing a master’s research applied to the area of ​​bioinformatics and I’m going to talk a little about what it’s like to work with this subject.

To begin with, you need to study biology, especially molecular biology. Concepts such as gene, genome, DNA, RNA, protein, molecule will be very present in your readings. Mainly for those who work in Computing, sometimes it is difficult to understand so much biological matter and at first it seems that you cannot absorb it. But by defining your problem, it will be easier to focus on what needs to be learned. The area is very comprehensive, it is not possible to learn everything.

Working with data within bioinformatics means having a multitude of publicly available data. From more comprehensive data on different subjects to specific data for a single organism. Banks like NCBI, Ensembl, Uniprot, PDB are well used. There has been data from experiments, from text mining, from other databases, from running analyses. Given this diversity, when starting an analysis, always pay attention to the origin of the data. Do you have any curation? Do you have any work that describes this data and how it was generated? Was it some algorithm? Your results will depend a lot on this source.

There are very user-friendly databases with very good usability. Some even provide APIs to facilitate the search and extraction of data. But in many cases usability doesn’t help much. The search for data is not so trivial and it will be necessary to do several searches within the same database. Many banks make the data available via FTP and it will be necessary to navigate between the folders to carry out the extractions.

Another characteristic of the area is that the databases are not linked to each other and data integration can be very laborious. It is often possible to use an intermediate database to be able to integrate two other databases. Most of the time the data is also not modeled in a relational model. There are several formats depending on the subject. As the area I work in refers to the primary structure of proteins, text formats are always present, including the famous Fasta.

Still talking about text formats, the execution of scripts is quite common, especially to generate new data. So languages ​​like Perl, Python will be very present in scripts for bioinformatics. It doesn’t mean that you only use them, but they are quite frequent. Still about the languages, the execution of these scripts is often done via the command line and some Linux distribution will make this job much easier. Currently, with Jupyter, Google Colab, Kaggle and even the Cloud environments, it is easier to work.

As already mentioned, the data comes from several sources, so there may be missing data, outliers or some kind of error. Allow time for exploratory analysis and pre-processing. Unbalanced bases are very frequent, so when applying Machine Learning models, this problem will probably appear and it will be necessary to apply some technique to try to balance it, since in most cases it is not so easy to add more real data.

About Machine Learning, it should be noted that algorithm results need biological validation to be considered correct. So many of the results are suggestive and serve mainly to help the biological professional in their validation experiments. A very interesting application is that predictions can help define what will be validated, reducing the amount of tests and consequently time. Imagine testing all possible hypotheses?

There are many tips and it is possible to pass much more. But it is important to highlight that the techniques make sense in the problem. You can study statistics, machine learning algorithms, data visualization, pre-processing techniques, exploratory analysis, but defining what should be applied will depend on the area that will be applied. Including because the way it is analyzed in a problem, can be totally different in another. The area of ​​bioinformatics is huge, fascinating and has great potential. But it takes time, a lot of research and focus, but there is no ready-made formula.

--

--

Jéssica Costa
A Garota do TI

Mestre em Ciência da Computação, GDE em Machine Learning e Cientista de Dados