Behind the scenes — Interview with a female data scientist who is studying data of breast cancer cells.

Maryam Soleimani Dodaran is a PhD student within the European research consortium EpiPredict and works at the Swammerdam Institute for Life Sciences in the Netherlands. Epipredict is a collaboration between 15 research institutes and companies to understand and predict why some breast cancer patients develop resistance to hormone therapy.

Maryam Soleimani Dodaran

To start off, can you tell us what is your role in this ambitious international research effort?

I am a data scientist for EpiPredict, which means I am dealing with a lot of DNA data. We want to understand why patients could become resistant to a certain breast cancer therapy and we believe the answer lies in studying differences in genetics (DNA), but also epigenetics (for example switches on the DNA that turn genes on and off).

Ok, wait a minute. Epigenetics, can you explain that a bit more?

Epigenetics is the layer on top of the DNA that is involved in the functioning of the molecular code of A, T, C and G’s. Part of its known function is turning genes on and off. Genes are the parts in the DNA that code for proteins. These proteins perform all the functions in living cells. You have to imagine that not all DNA is constantly active. Some genes are only active under certain conditions, for example they could be activated in the presence of a drug. If you want to learn more about epigenetics, some EpiPredict fellows wrote a nice blog about this topic that you can read here.

Alright, back to your data. How much data are we talking about?

We are talking about huge amounts of data. Data measured on the level of a whole genome is always big. To give you an idea, one human genome sequence is ~3 gigabase and contains ~20,000 genes. If we would only save the data for activity of these 20,000 genes, it means that we have to save 20,000 values (1 for each gene) for each patient sample. You can imagine, this causes the database to grow very fast in size. Then, we would add even more data concerning epigenetics, we would end up with an even bigger data set.

Where is all of that data actually coming from?

Big international consortia could be one of the valuable sources of data. Such big data sets are expensive to produce and maintain, therefore there are special organizations that handle this task.

Let me give you an idea of what I am talking about. In an attempt to study different types of cancer, The Cancer Genome Atlas (TCGA) profiled 7 different datatypes (e.g. DNA sequence, gene measurements, protein measurements, etc.) for 33 different tumor types (e.g. breast, liver, lung) from 11,000 patients. The data from this study can be saved on no less than 212,000 DVDs! But it is a very valuable asset to study cancer.

Our EpiPredict consortium is also generating a dataset for breast cancer by profiling epigenetic datatypes. For example, we map specific changes on the DNA known to influence the activity of genes. This data will hopefully help us answer more specific questions.

Why is it interesting to search through all of that data? What questions are you trying to answer?

There are several questions we can try to answer with this data. For example; Could we identify the cause of resistance to therapy in breast cancer? Resistance may be the result of a change in the DNA or a change in the activity of a gene. But finding the changes that cause cancer is not an easy task because our genetics and especially epigenetics change all the time. Not all of these changes cause cancer, so we have to try to find the changes that matter. At the moment, we have new technologies that provide us with tools to edit our DNA and epigenetics. When we find a potential harmful change in the data, we can mimic this in the lab by editing the (epi)genome in a similar manner. In this way, we can verify if the identified change was causal for the cancer. The dream is that one day in the future we can edit the (epi)genome of a patient back to normal as a treatment.

So, this could potentially help individual patients?

Yes, that is possible. It is true that prognosis for two cancer patients could be the same but in reality, these patients could have a unique (epi)genetic profile and may therefore benefit from different therapies. Using the (epi)genetics data that has been measured for each patient could hopefully lead us to a tailored therapy based on the (epi)genetic profile of each patient individually.

I can imagine it’s quite challenging to make sense of all that data. What is your biggest challenge?

The sheer size of the data sets is challenging to download and store. Often these data sets contain fields that you don’t use and the structure of the stored data is not per se efficient for the purpose of your study. It already takes quite some time to curate the data set. Digging through these big datasets also relies on powerful servers to run the time consuming repetitive processes on the data.

Is it cool to be a scientist?

It is cool when we find new leads in our data! In every profession, there are moments of repetitiveness and boredom or hardship. There are definitely times that I cannot find any substantial result and a solution seems to be out of reach. However, any finding, despite all of these problems, makes me feel that I am cracking the code hidden in the big matrices of data. That makes me excited about data science!

This blog was established in collaboration with EpiPredict Partner Organisation Science Matters. The EpiPredict consortium ( receives funding from the EU Horizon 2020 program under Marie Skłodowska-Curie grant agreement no. 642691.