Cholangiocarcinoma: CDKN2A [2]

Simon Tse
Learn about Cancer with Code
3 min readFeb 1, 2023

--

Credit: https://en.wikipedia.org/wiki/Cholangiocarcinoma

Background

In last post, I have covered how to retrieve text from KEGG’s entries and Wikipedia. I also mentioned there are inconsistency in the terms used in text from respective sources that requires some standardisation.

In this post, I am going to talk about how to pre-process the raw text.

Approach

3. Pool texts collected from KEGG and Wikipedia together for NLP analysis (Cont’d)

In this post, I am going to present how to group and clean up the text for next step: NLP. I have run following script to collate the raw texts.

Created by author

Then I am using Stanza the NLP library to extract and collect biomedical entities.

Created by author

After running the above script, it outputs a list of entities.

['melanoma', 'reactive oxygen species', 'alpha', 'DNA', 'tumour cell lines', 'coronary artery', 'p21', 'E2F', 'pancreatic adenocarcinoma', 'non-small cell lung carcinoma', 'red hair', 'exon 2', 'CDK6', 'cancer cell lines', 'p19ARF', 'esophageal squamous cell carcinoma', 'D-type cyclins', 'oral cancer', 'skin', 'INK4a', 'p14arf', 'Rb', 'skin cancer', 'p16', 'human', "Burkitt's lymphoma", 'tumors', 'p14', 'tumour', 'proliferating cell nuclear antigen', 'People', 'E3', 'myocardial', 'chromosome 9', 'exon 1α', 'ARF from exon 1β', 'cutaneous malignant melanoma', 'Rb.', '6', 'Melanoma', 'cancer', 'eukaryotic cells', 'retinoblastoma', 'cellular', 'humans', 'CDK4', 'D-type cyclin', 'M(r) 16K', 'ankyrin', 'cell lines', 'human papilloma virus E6', 'mitochondrial', 'cyclin dependent kinases 4', 'gastric cancer', 'cancers', 'glioblastoma', 'cells', 'p14ARF', 'CDKN2 cyclin-dependent kinase', 'CDKs', 'surface', 'INK4', 'cyclin-dependent kinase inhibitor 2A', 'p16 from exon 1α', 'eyes', 'mouse', 'MDM2', 'retinoblastoma protein', 'CIP1', 'P14ARF', 'P53', 'epithelial ovarian carcinoma', 'inner wall', 'CDKN2A', 'head & neck squamous cell carcinoma', 'gastric lymphoma', 'cyclin-dependent kinases', 'tissues', 'cyclin dependent kinases', 'cell', 'hereditary melanoma', 'CDKN2A beta', 'senescent cells', 'exons – exon 1β', 'chromosome 9p21', 'p53', 'Cancer', 'tumor', 'CDK4/6', 'band 9p21', 'P16', 'amino acid', 'people', 'E2F1', 'protein kinase C delta', 'cyclin D enzymes', 'p16INK4a', 'statin', 'ARF', 'E2F-1', 'prostate cancer', 'Rb. P16', 'colorectal cancer', 'pancreatic cancer']

As mentioned in last post, the nomenclature of biomedical entities are not unique. Sometimes, several different abbreviations are referring to the same entity. And there are variations in the usage across texts that requires standardisation. After a very rough examination, I decide to adopt following conversion scheme.

Created by author

Intermission

In this post, I have covered my approach to pre-process text for next task: relationship extraction between entities. In next post, I will introduce my approach: using graph/network analysis.

A bit of warning: I still experiment with this approach and am far from having a concrete answer! I may take another detour to dive deep into that area.

Stay tuned.

--

--

Simon Tse
Learn about Cancer with Code

Try to apply my ML/NLP knowledge to problems I am interested in and create a narrative with the data. Current Interest: Cancer Biology