I analyzed a breast cancer samples dataset and here’s what I found

8 min readJan 6, 2024

It’s great that scientists were able to find a way to read the human genome but the problem comes in when this data they collect is extremely large and compact. How are they even able to study this data? How can they know they have come to an accurate conclusion? These and more were the questions that drove me to begin my journey in bioinformatics that is a field that combines biology, computer science, and statistics to analyze and interpret biological data, particularly large datasets such as those generated from genomics. I wanted to understand clearly how they navigate through this information and what impact it could have.

Gene expression data analysis:
Is gene expression data even useful in biological research?

Now, we know that a gene is a length of the DNA that codes for a specific protein. In order to activate the synthesis of these proteins, the genes must first produce messenger RNA (mRNA). The information that shows which genes are actively producing mRNA in a particular cell, tissue, or organism at a specific point in time is what we are referring to as gene expression data. It is crucial for understanding the mechanisms underlying normal development, disease progression, and responses to various environmental stimuli.

This kind of information can be obtained by 3 different methods: microarray technology, quantitative PCR and RNA-sequencing (which we will be focusing on). RNA- sequencing involves sequencing the entire transcriptome of a sample, providing a comprehensive and quantitative view of gene expression. The transcriptome basically represents the complete set of RNA molecules, including messenger RNA (mRNA).

RNA-seq data is obtained following a number of steps done in the lab with the final one being sequencing using next generation sequencing (NGS) methods such as Illumina sequencing. The data that is obtained from this process must be analyzed to understand gene expression, identify differentially expressed genes, and gain insights into biological processes and regulatory mechanisms and this is the task I took on.

Computer Science X Biology

I used the programming language R to manipulate and visualize the data specifically the GSE183947 dataset which contains RNA sequencing (RNA-seq) data of breast cancer samples, including normal, primary, and metastatic tumor samples.

The first step to obtaining this was to download the data from the NCBI Gene Expression Omnibus (GEO) website, which is a public repository of high-throughput gene expression data and other functional genomics datasets, where I found a deeper description of the dataset.

Having selected the http option under download, the file was downloaded in a .gz format which I then converted to a csv file and was able to view the data in tabular form with rows of the file representing different genes, and the columns representing different samples. The first column contains the gene names, and the remaining columns contain the gene expression levels in different samples. The gene expression levels are measured in FPKM (Fragments Per Kilobase of transcript per Million mapped reads) and TPM (Transcripts Per Million) units, which are normalized measures of gene expression.

Gene expression data manipulation

Just having the data isn’t enough for scientists to clearly understand and analyze the data to come to a conclusion. In the case of the dataset I used, I would not be able to finalize on which gene has the highest expression values in cancerous tissue sample or which one doesn’t especially since the information provided is quite broad.

So, I had to manipulate the data using the programming language R to take on tasks such as such as loading the data, performing quality control (basically removing samples with abnormal expression patterns, normalizing the data, and conducting statistical analyses to identify genes that are differentially expressed between the different sample types.

Normalizing the data refers to the process of adjusting for differences in library size and composition between samples. This was an important step in my manipulation because the raw gene expression counts may be influenced by factors such as the total number of reads obtained for each sample, which can vary due to technical reasons rather than biological differences. Normalization is also crucial in a way that it made sure that the gene expression values are comparable across samples, allowing for more accurate comparisons of gene expression levels between samples.

Since I was working with a very large dataset, I had to utilize libraries in my R code which were the GEOquery, tidyverse and dplyr.

I used the GEOquery package to retrieve and analyze gene expression data from the Gene Expression Omnibus (GEO) database. Specifically, I wanted to obtain the metadata which included the supplementary file, donor, metastasis (spread of cancer cells from the primary tumor site to other parts of the body through the lymphatic system or bloodstream) and tissue for accurate interpretation and meaningful analysis.

Once the data was obtained using GEOquery, the dplyr and tidyverse packages were used for data manipulation, cleaning, and transformation, as well as for conducting downstream analyses such as differential expression analysis, gene enrichment analysis, and data visualization which is basically Identifying genes that behave differently between groups, understanding the biological functions associated with a group of genes and presenting data visually to make it easier to understand and interpret.

Data manipulation was an important step because it affected the accuracy and reliability of the analysis. By cleaning and transforming the data, I was able to ensure that the results of the analysis were valid and meaningful. Data manipulation was also helpful in a way that I could identify patterns and trends in the data that were not apparent in the raw data. It also gave me the opportunity to communicate my findings more effectively by presenting the data in a clear and concise manner.

Data Visualization with ggplot2

Understanding the behavior of cancer cells and the impact of genetic mutations on gene expression can provide valuable insights into tumor development and progression and to be able to access this analysis, I had to use ggplot2 for visualizing the gene expression data. ggplot2 is a data visualization package part of the tidyverse collection in R that allows users to create a wide range of plots.

With this I was able to create visually appealing and easy-to-interpret plots that accurately represent the data and identify patterns and trends in the data that were not apparent in the raw form.

There are a number of plots I was able to cover from bar plot to scatter plot so let’s get into each one and understand the significance of each.

Bar plot:
Bar plots are a type of graph that presents categorical data with rectangular bars, where the length of each bar is proportional to the value it represents. In the context of the gene expression data I was working with, the bar plots were used to visualize the FPKM values of specific genes.

2. Density plot
A density plot is a data visualization tool that is used to display the distribution of a continuous variable. The y-axis represents the probability density of the variable, and the area under the curve is normalized to be equal to 1. I used the density plot to visualize the distribution of expression levels for a particular gene across different samples. This was helpful for identifying genes that are differentially expressed, as well as understanding the overall distribution of gene expression within the dataset.

3. Boxplot
A box plot is a type of chart that depicts a group of numerical values using their quartiles, which are divided into four equal parts, known as quartiles. I used the box plots to visualize the distribution of a continuous variable, including the upper and lower bounds (whiskers) that show the range of the data, except for outliers. Outliers are data points that fall below the first quartile minus 1.5 times the interquartile range (IQR) or above the third quartile plus 1.5 times the IQR. Box plots enabled me to compare the distributions of gene expression between different groups such as metastasis.

4. Scatter plot
A scatter plot is a type of data visualization that is used to display the relationship between two continuous variables. In a scatter plot, each data point represents the values of the two variables, and the position of the data point on the graph is determined by its x and y values which in this case were the different gene expression levels. This allowed for the visualization of the pattern, direction, and strength of the relationship between the two genes BRCA1 and BRCA2.

So what though?

Gene expression data analysis, manipulation, and visualization are essential in biological research for understanding disease mechanisms, biomarker discovery, drug development, personalized medicine, biological pathway analysis, and data visualization. Researchers can use these skills to analyze and interpret gene expression data, identify differentially expressed genes, and understand the underlying biological processes. By analyzing the data critically, researchers are able to gain insights into the molecular basis of diseases, identifying potential therapeutic targets, and advancing personalized medicine.

In conclusion, this journey of bioinformatics has not only showed me which gene samples have a direct correlation to breast cancer but also the great importance of being able to work with this data. It only leaves me to wonder how much more we could achieve with this field.

I analyzed a breast cancer samples dataset and here’s what I found

Written by Muhwezi Emily Karen