Cholangiocarcinoma: CDKN2A [2]
Background
In last post, I have covered how to retrieve text from KEGG’s entries and Wikipedia. I also mentioned there are inconsistency in the terms used in text from respective sources that requires some standardisation.
In this post, I am going to talk about how to pre-process the raw text.
Approach
3. Pool texts collected from KEGG and Wikipedia together for NLP analysis (Cont’d)
In this post, I am going to present how to group and clean up the text for next step: NLP. I have run following script to collate the raw texts.
Then I am using Stanza the NLP library to extract and collect biomedical entities.
After running the above script, it outputs a list of entities.
['melanoma', 'reactive oxygen species', 'alpha', 'DNA', 'tumour cell lines', 'coronary artery', 'p21', 'E2F', 'pancreatic adenocarcinoma', 'non-small cell lung carcinoma', 'red hair', 'exon 2', 'CDK6', 'cancer cell lines', 'p19ARF', 'esophageal squamous cell carcinoma', 'D-type cyclins', 'oral cancer', 'skin', 'INK4a', 'p14arf', 'Rb', 'skin cancer', 'p16', 'human', "Burkitt's lymphoma", 'tumors', 'p14', 'tumour', 'proliferating cell nuclear antigen', 'People', 'E3', 'myocardial', 'chromosome 9', 'exon 1α', 'ARF from exon 1β', 'cutaneous malignant melanoma', 'Rb.', '6', 'Melanoma', 'cancer', 'eukaryotic cells', 'retinoblastoma', 'cellular', 'humans', 'CDK4', 'D-type cyclin', 'M(r) 16K', 'ankyrin', 'cell lines', 'human papilloma virus E6', 'mitochondrial', 'cyclin dependent kinases 4', 'gastric cancer', 'cancers', 'glioblastoma', 'cells', 'p14ARF', 'CDKN2 cyclin-dependent kinase', 'CDKs', 'surface', 'INK4', 'cyclin-dependent kinase inhibitor 2A', 'p16 from exon 1α', 'eyes', 'mouse', 'MDM2', 'retinoblastoma protein', 'CIP1', 'P14ARF', 'P53', 'epithelial ovarian carcinoma', 'inner wall', 'CDKN2A', 'head & neck squamous cell carcinoma', 'gastric lymphoma', 'cyclin-dependent kinases', 'tissues', 'cyclin dependent kinases', 'cell', 'hereditary melanoma', 'CDKN2A beta', 'senescent cells', 'exons – exon 1β', 'chromosome 9p21', 'p53', 'Cancer', 'tumor', 'CDK4/6', 'band 9p21', 'P16', 'amino acid', 'people', 'E2F1', 'protein kinase C delta', 'cyclin D enzymes', 'p16INK4a', 'statin', 'ARF', 'E2F-1', 'prostate cancer', 'Rb. P16', 'colorectal cancer', 'pancreatic cancer']
As mentioned in last post, the nomenclature of biomedical entities are not unique. Sometimes, several different abbreviations are referring to the same entity. And there are variations in the usage across texts that requires standardisation. After a very rough examination, I decide to adopt following conversion scheme.
Intermission
In this post, I have covered my approach to pre-process text for next task: relationship extraction between entities. In next post, I will introduce my approach: using graph/network analysis.
A bit of warning: I still experiment with this approach and am far from having a concrete answer! I may take another detour to dive deep into that area.
Stay tuned.