NOTICE : Challenge 1 ENT Data description

Marc Fournier
Published in
7 min readSep 24, 2021

NOTICE: Description of the ENT Challenge 1 data

The challenge is open to all participants, specialists, and non-specialists alike, for this purpose a lexicon of medical terms is produced at the end of the document.

ENT cancers are mainly due to alcohol and tobacco. In almost all developed countries, a new entity is emerging, in a quasi-epidemic fashion: ENT cancers linked to the human papillomavirus. (HPV). Its implication is widely known, at the level of the uterine cervix where it is the subject of a systematic screening from the beginning of the sexual life but its role is still little known at the ENT level. Over the past few years, the proportion of ENT cancers due to HPV has continued to grow, until it becomes the majority in the United States. This cancer mainly affects the oropharynx and therefore originates from a sexually transmitted infection. This cancer affects younger individuals and is a major public health problem. It is essential for researchers and clinicians to always understand the characteristics of these cancers better and of these patients, at the clinical level but also at the protein and molecular level as technology now allows us.

One of the keys to future cancer therapies is based on knowledge of the tumor microenvironment, the site of interactions between pathological cells and immune cells.

The pathologist studies this microenvironment using immunostaining techniques (specific recognition of proteins on the surface of cells), quantification of virus expression (RNAscope (R)) and many others. whose possibilities are increased tenfold by technology.

1 / What data do I have access to?

The data provided for Challenge 1 “ENT” come from the “KORL” cohort of 61 patients provided by the cytopathology department of the Georges Pompidou hospital (APHP platform shared with Inserm U970). The data sets are composed of immunostaining images of histological sections of ENT cancer tissue (otorhinolaryngology, i.e. head and neck tumors), cancers induced by the Human papillomavirus virus. (HPV) taken during biopsies or surgery. These images are provided with the clinical characteristics of the corresponding patients.

All of the Challenge 1 data made available as part of the Epidemium challenge has been anonymized by the company Octopize using “Avatar” technology and validated by the CNIL (more details on request).

Access to the data is possible only after validation of the T & Cs then on the platform (in case of concern, contact us on slack channel data / dataiku)

2 / Medical problem


- Can the positive stainings corresponding to the cells composing the tumor microenvironment be predictive of patient survival? Survival includes overall survival or overall survival (duration in months between diagnosis and death of the patient or the date of the latest news if the patient is alive)

Judgment criteria:

- Determine a statistical link between the presence of one or more markers, in the tumor, in the stroma or in the microenvironment as a whole and better overall survival.

Consideration of confounding data:

- Stratification of results according to WHO status, TNM stage, RNAscope, tobacco and alcohol poisoning

3 / Medical data production technique

- The cytological examination is carried out by optical microscopy in white light

- The tissue is cut, placed on a slide, from an operative part or from a biopsy

- HES staining (Hemalun-Eosine-Saffron) allows the study of tissues (histology), their architecture, and cells (cytology)

- The HES slides of the project patients are scanned. They are whole, not cut into stamps (portion of the original slide) as for the immunostaining.

- On a white slide (without HES): an immunostaining is carried out (from an antibody specific for a membrane protein of interest) which is revealed, by optical microscopy, in fluorescent light, by exciting a precise fluorochrome.

These are the images we have provided, each marking individually (one folder corresponds to one marking). It is immunofluorescence in situ (ie on unaltered tissue).

4 / What are the expectations?

- The challengers can easily reconstitute multi-stainings by superimposing the transmitted images: the interest is to identify correlations between better survival / or poorer survival (overall ie clinical data) and the staining (ie immunofluorescence data) individual, or combined, which leaves a considerable number of scenarios (phenotypes) to explore. A phenotype is a combination of markers (signature), each folder corresponds to a marker.

- A simple way is to separate the patients according to the richness of their samples in a certain cellular phenotype (ie expression of one or more markers on the same cell) according to the median in a “high” group and a “group”. low ”and to see if one or the other of these groups has a better survival than the other, significantly and statistically, and in analysis adjusted for the confounding factors transmitted (WHO, RNAscope (R), TNM, Age, Tobacco etc.)

- Another way of doing things would be to start from survival and depending on survival, analyze whether there are correlations with certain phenotypes, so the challenge is open, participants can start either from survival or from the composition of the tumor microenvironment

- Consider that we do not know, a priori, in this challenge, if whether a phenotype (or a signature) is associated or not (and therefore predictive) of better survival. Pay attention some cells car been stained by numerous markers.

- Attention, challengers incorporate patients data: each patient / avatar corresponds to a certain number of stamps (min 1, max 4) of which you have stained images

5 / Description of clinical data (tabular)

Patient_ID: avatar’s ID

WHO: WHO score, general condition of the patient assessed according to the following table


Capable of an activity identical to that preceding the disease, without any restriction


Reduced physical activity but ambulatory and able to carry out work


Outpatient and able to take care of oneself, unable to work. Bedridden less than 50% of his time


Capable of only some personal care. Bedridden or in a chair more than 50% of the time


Unable to take care of himself, bedridden or in a chair at all times

Gender: man or woman

Age: date of birth

Date_biopsy: date of the biopsy

Age_diag: age at diagnostic time

Last_date_nouvelles: date of last contact

Death: deceased: yes no

OS: overall survival in months evaluated from the date of diagnosis to the date of the last news / death, variable to predict

Recurrence: yes no

Location: localization of the cancer

RNA-scope: (= in situ hybridization ) : quantitative evaluation scored from 0 to 2 of the transcriptional activity of the E6 and E7 oncoproteins of HPV, basically corresponds to the quantitative evaluation of the transcription activity of DNA viral in tumor cells.

T: T for tumor, 1–2 (small) 3 or 4 (large) according to strict UICC 8th HNSCC classification criteria — in PJ for illustration, no need to give it to participants

N: N for Node, cervical metastatic lymph nodes: N0 (no lymph node invaded), N-1–2–3 (lymph node metastases). N3 is more invaded than N1.

M: M, distant metastasis to another organ: M0 (no distant metastasis) and M1 (metastasis)

- TNM stage: assesses clinical aggressiveness and tumor extension, survival is obviously correlated with TNM

Tobacco: tobacco smoker: yes no
Numbers 0 1 2 3 are correlated to a “score” :
The number of packets smoked daily x number of years . (ex : 1 packet per day during ten years =10; 3 packets during 3 years =9)
0 means non smoker, 1 means a score of 0<10, 2 a score of 10<20, 3 A score >30

Alcohol: alcohol drinker: yes no

To extract Disease free survival :
Date of biopsy until recurrence / last news / death.

6 / Description of the image data

1 folder = 1 marker

Component data: Scanned slide stamps

Composite image (in each folder except Tissue segmentation):

Immuno-labeling / fluorescence in situ (on tissue). Antibody that identifies a protein. This antibody revealed by fluorochrome makes it possible to identify the presence of this protein.

The areas in red correspond to the marker, and those in light blue (cyan) represent the tumor (DAPI)

Tissue segmentation:

Stamp on which tissue segmentation has been performed. The areas without tissue are in blue, those containing the stroma in green, and the areas where the tumor is shown in red.

Poor quality data :

Cell 1: 8aab52, 8d1247, 91de1b, 832c5f, 93114f

Cell 2: 8aab52, 841343

Cell 4: 8a11b8, 8aab52, 8af7fa, 8b44cd, 8c2bf2, 8d1247, 8e4552, 8fc4c1, 92c47e, 8545ea, 8592f4, 93f80d, 841343, 829316

Cell 5 (No Dapi) very weak marking

Cell 6, very strong DAPI

For avatars n°
- 83c65f ( difference between biopsy and last visit is 5 months, however the OS is 1 month) : possible typo .
-8592f4 (biopsy was after their last encounter with a physician)
Medical team is investigating if it is a typo from their side or a problem during the anonymisation process.
-8b44cd (patient whose difference between biopsy and last visit is 3 months, however the OS is 7 months) medical team had a record of 3 month since last visit but they accessed a national database that confirmed the patient dead 7 month after, this last info is the good one.

7 / Evaluation criteria


Currently being defined by Medhi Benchoufi and the scientific / ethics committees.

8 / Lexicon

- Stroma: tissue surrounding the tumor

- Tumor: identified by a positive cytokeratin labeling

- Microenvironment: stroma + tumor

- Tissue segmentation: distinction of the microenvironment between stroma and tumor

- WHO score: general condition of the patient assessed according to the following table

- Cytology (from the Greek cytos + logos: study of cells 1 ) is the study of isolated cells. This is the study of normal or pathological cells (cytopathology), as well as their morphological or biochemical appearance.

- Histology is the morphological study of biological tissues.

- A phenotype is a combination of markers, each file corresponds to a marker.

- Stamp: original blade portion

This document was co-authored by the Epidemium / APHP teams

