Cancer Transcriptomics ML
Machine learning classification of tumour vs. normal tissue across 5 TCGA cancer types, filtered through two-scale evolutionary analysis to identify candidate cancer-maintaining gene dependencies.
163 candidate genes identified
→ 15 cross-validated across cancers
→ Established drivers confirmed (TP53, PIK3CA, PTEN)
Central hypothesis: ML-predictive genes under both strong germline purifying selection (dN/dS < 0.3) AND somatic positive selection (dN/dS ≥ 1.5, FDR < 0.05) are candidate cancer-maintaining dependencies.
🧬 Cancer Type Overview
BRCA — Breast —
Bal. Accuracy—
Specificity—
AUC—
Samples—
Genes Tested—
BLCA — Bladder —
Bal. Accuracy—
Specificity—
AUC—
Samples—
Genes Tested—
PRAD — Prostate —
Bal. Accuracy—
Specificity—
AUC—
Samples—
Genes Tested—
LUAD — Lung Adeno. —
Bal. Accuracy—
Specificity—
AUC—
Samples—
Genes Tested—
UCEC — Uterine —
Bal. Accuracy—
Specificity—
AUC—
Samples—
Genes Tested—
⚠️ Limitations & Caveats
- Somatic dN/dS method: Uses a per-gene binomial exact test, not the site-level dNdScv model. Results should be interpreted as exploratory.
- UCEC candidate count: Elevated candidate numbers in uterine cancer likely reflect microsatellite-instability-driven hypermutation rather than a proportionally larger set of true dependencies.
- PRAD statistical power: Prostate cancer has the lowest specificity (73.5%), driven by smaller normal-tissue sample size and adjacent-normal heterogeneity.
- Near-perfect AUC values: High AUC scores reflect the intrinsic separability of tumour vs. normal transcriptomes (thousands of DE genes) rather than the specificity of the final gene signatures.