Cancer Transcriptomics ML

Machine learning classification of tumour vs. normal tissue across 5 TCGA cancer types, filtered through two-scale evolutionary analysis to identify candidate cancer-maintaining gene dependencies.

163 candidate genes identified → 15 cross-validated across cancers → Established drivers confirmed (TP53, PIK3CA, PTEN)
💡 Central hypothesis: ML-predictive genes under both strong germline purifying selection (dN/dS < 0.3) AND somatic positive selection (dN/dS ≥ 1.5, FDR < 0.05) are candidate cancer-maintaining dependencies.
🧬 Cancer Type Overview
BRCA — Breast
Bal. Accuracy
Specificity
AUC
Samples
Genes Tested
BLCA — Bladder
Bal. Accuracy
Specificity
AUC
Samples
Genes Tested
PRAD — Prostate
Bal. Accuracy
Specificity
AUC
Samples
Genes Tested
LUAD — Lung Adeno.
Bal. Accuracy
Specificity
AUC
Samples
Genes Tested
UCEC — Uterine
Bal. Accuracy
Specificity
AUC
Samples
Genes Tested
⚠️ Limitations & Caveats
  • Somatic dN/dS method: Uses a per-gene binomial exact test, not the site-level dNdScv model. Results should be interpreted as exploratory.
  • UCEC candidate count: Elevated candidate numbers in uterine cancer likely reflect microsatellite-instability-driven hypermutation rather than a proportionally larger set of true dependencies.
  • PRAD statistical power: Prostate cancer has the lowest specificity (73.5%), driven by smaller normal-tissue sample size and adjacent-normal heterogeneity.
  • Near-perfect AUC values: High AUC scores reflect the intrinsic separability of tumour vs. normal transcriptomes (thousands of DE genes) rather than the specificity of the final gene signatures.