Cancer Genomics Research — Resources and Databases

7 min readNov 29, 2023

One of the many fascinating things I’ve learnt is genomics research is the largest global contributor to big data. The picture below says it all. I mean, social media data has nothing on genomics.

The extent of genomic data by Tyrone Chen

In cancer research, researchers and clinicians depend on many databases to make sense of their data and generate insights. These resources help identify genetic variants that may be associated with risk, patient outcomes, treatment efficiency, and disease progression.

The first cancer genome was sequenced in 2008, and the data keeps growing. As a cancer genomics researcher myself, I’ve been overwhelmed with the numerous available platforms to gain insights. So, I figured I’d make a brief list of cancer genome databases I use for bioinformatics analyses.

Cancer Bioinformatics Resources I’ve used

Here’s what I know about my research as a first-year PhD student. I’ll be using computational methods–programming, data analysis, statistics, and whatever else arises–to analyse cancer genome data generated from long-read sequencing technology. That means the typical variant calling analysis and more. Some resources I’ve used so far:

1. The Catalogue of Somatic Mutations in Cancer (COSMIC)

COSMIC is a comprehensive database of somatic mutations in human cancer, that is, the mutations that aren’t heritable. It contains manually curated data from experts and integrated data from other resources, like the Cancer Genome Atlas (TCGA). This data is free for all users as long as you register with an academic email address.

COSMIC provides information on the frequency, location, and functional impact of somatic mutations across various cancer types. COSMIC has about 10 embedded tools that allow users to explore the available data. I’ve used the cancer gene census (CGC) for my research. I downloaded a list of genes and their associated roles in cancer, which I used to annotate the genome data from my analysis. It is a valuable resource to give context to a collection of supposedly interesting chromosomal regions.

2. Integrative Genome Viewer (IGV)

IGV is an interactive visualisation tool developed by TCGA researchers to explore large, integrated genomic datasets. I’ll talk about this TCGA in a minute, as it keeps appearing. It has a web version, which is great for general accessibility, and a desktop version, which I use.

It works with most bioinformatics file formats, like bam, vcf, and bed files. IGV gives users a visual representation of their data. For instance, after calling mutants/variants from my cancer genome data, I’ll have a vcf file with a bunch of text telling me what location in my genome differs from the reference genome. If I upload these files to IGV, I can zoom in on specific chromosome locations and confirm the presence of said variant.

It’s especially great if you’re working with new tools and you’re not sure how well they perform. The phrase pictures don’t lie isn’t true anymore, thanks to AI, but in this case, pictures really don’t lie. If there’s a variant, you’ll see it. For slightly advanced tasks, it takes reading the user guide and playing around the menu to get the image you want.

3. University of California Santa Cruz (UCSC) Genome Browser

The UCSC genome browser is used for visualising and analysing genomic data. It’s similar to IGV, but you can interact with a wider variety of data types, including genomic sequences, annotations (e.g., genes, transcripts, regulatory elements), and variation data (e.g., SNPs, indels, CNVs).

The UCSC Genome Browser has data tracks explaining different genomic features, customisable views for each track, and search and annotation tools. You can see all the genes present in a region and explore further to find out if it’s associated with any disease or what cells express it the most. I’ve only used it to view regions of interest in the reference genome it holds, but it allows users to upload their data, too.

In addition to a general genome view, UCSC also has a dedicated cancer genomics browser, which I just found out. I haven’t tried it yet.

4. gnomAD

The Genome Aggregation Database (gnomAD) is a resource containing genetic variation data from different human populations. It has information on common and rare variants, allele frequency of single nucleotide variants (SNVs), small insertions and deletions (indels), and variant annotation, including their functional impact and disease association.

With gnomAD data, you can filter out known variants from your data and retain the rare ones, which may have functional implications relevant to your research question. This filtration is usually done based on the value of allele frequency. That means you pick a threshold and drop all variants that exceed that threshold. You can also use it to annotate your variant data, so you’ll know if, for instance, a variant is missense or silent.

Other Cancer Genome Resources I Interact With Directly/Indirectly

This includes some resources I’ve encountered during literature research or going down the rabbit hole before I remember to pull my head out.

1. The Cancer Genome Atlas (TCGA)

You can’t be a cancer researcher and not know about it. The TCGA is a large-scale 12-year project funded by the National Institute of Health (NIH) and the National Human Genome Research Institute (NHGRI) in 2006 to create a comprehensive catalogue of different cancers and their genomic profiles. Today, It holds a gigantic amount of genomic, epigenomic, transcriptomic, and proteomic data.

Some key points to note: It holds data for over 30 types of cancer. It has improved the ability to diagnose, predict, and treat some cancers. It helps develop computational tools for cancer research because of the available data to conduct tool tests. It’s a collaborative project involving thousands of researchers and tens of thousands of patients who donated samples for data generation.

2. International Cancer Genome Consortium

ICGC is similar to TCGA but is more global because the data it holds comes from cancer patients worldwide. It was launched after TCGA and currently holds tumor genome data for over 50 different cancer types. The ICGC data portal provides slightly limited access, depending on what you need. For instance, somatic variant data is freely accessible, but clinical data would require some extra steps before it’s released.

The website says the portal is set to retire in 2024, and I’m not sure what that means for future research.

3. National Cancer Institute (NCI) GDC

The Genomic Data Commons (GDC) is a public repository of genomic and clinical data for cancer patients. The GDC provides access to data from many sources, including TCGA, ICGC, and other large-scale cancer genomics projects.

It enables data sharing across multiple genomics researches and allows free download and data submission from researchers. It uses the same bioinformatics pipelines for analyses, which makes it easy to compare data from different sources and promotes unified data management,

4. cBioportal

It was developed by TCGA researchers at Memorial Sloan Kettering Cancer Centre to analyse cancer genomics data. cBioportal is a web-based resource that provides access to various cancer-related data sets, including genomic, expression, methylation, clinical, and phenotypic data.

cBioportal allows users to query and analyse these data sets to identify patterns and associations between genetic alterations and clinical outcomes. Users can explore biological pathways and perform survival analysis. It complements other tools like the IGV and UCSC genome browser and can access data from different programs.

5. Pan-Cancer Analysis of Whole Genomes (PCAWG)

The PCAWG Consortium is an extension of the TCGA and ICGC that aims to sequence the whole genomes of over 2,800 cancer patients. The difference here is the generated data will provide more insight into regulatory and noncoding genomic regions compared to other projects that mainly targeted the coding regions.

The PCAWG data will be used to identify novel mutations, determine the prevalence of known mutations, and assess the clinical significance of these mutations across a wide range of cancer types.

Why are There So Many? Wouldn’t It be Confusing?

Technology keeps evolving every day, and by default, biomedical research does too. That means all over the world, people identify new things–genes, relationships, associations–and publish them. So, these large databases store and organise a vast amount of information.

Other reasons include:

The complexity of cancer: Cancer is a complex disease with many genetic and clinical variations. This complexity necessitates using multiple databases and knowledge bases to capture the full spectrum of cancer-related information.
The need for specialised databases: Different types of cancer research require different types of data. For example, researchers studying drug resistance need access to data on genetic variants and their associated drug responses. Researchers studying cancer epidemiology need access to data on patient demographics, tumor characteristics, and survival outcomes.

What These Resources Have in Common

Free usage/access to data and easily downloadable.
Tutorials for usage–videos and/or text.
Slightly complex interfaces, which were intended to be user-friendly. By the way, this speaks more to the complexity of genome data than the actual UI design. The designs are great.
They can be used in parallel. You’ll find TCGA data in almost every resource, and there are many existing initiatives to integrate data from different sources for a better experience.
Regular updates. We can’t overstate the importance of this one here.

P.S. The initial idea was to list all cancer genomics resources with a brief description. But, boy, there are too many of them. So, I stuck to familiar ones only.

P.P.S. Thank you so much for reading, especially when you read multiple times. I see you. As much as this looks like a documentation/work diary for me, I’d like you to know I take requests, too. So, feel free to reach out if you’re here and have questions/jumping thoughts or ideas on bioinformatics, genomics analysis, biotech, or science writing. I can’t promise I’ll definitely write about it, but I know for sure I’ll respond to you.