Machine Learning–Guided Identification of Cancer-Maintaining Gene Dependencies Through Two-Scale Evolutionary Filtering of TCGA Transcriptomes
Figures and Legends
Figure 1. Analytical pipeline overview. RNA-seq counts and MAF files for — TCGA cohorts (—) were obtained via GDC API. DESeq2 pre-filtering retained genes with |log₂FC| > 1.5 and BH-adjusted p < 0.05. Three classifiers (LR, RF, MLP) were trained with 5-fold stratified CV; union feature-importance signatures were passed through two evolutionary filters: germline purifying selection (Ensembl Compara dN/dS < 0.3) and somatic positive selection (binomial test, dN/dS ≥ 1.5, FDR < 0.05). Cross-cancer genes were defined as candidates in ≥ 2 cohorts.
Figure 2. TCGA cohort composition and biological context. (A) Sample counts across — cancer types. SMOTE oversampling was applied to PRAD and BLCA normal classes within CV folds; ComBat-seq batch correction was used for PRAD adjacent-normal samples. (B) Protein-coding genes retained after DESeq2 filtering (|log₂FC| > 1.5, BH-adjusted p < 0.05) and pseudogene removal. (C) Cohort-level summary statistics. (D) Cancer-type biological context.
Figure 3. ML classifier performance across — cohorts. (A) MLP classification metrics (5-fold stratified CV). (B) Specificity gain from baseline to optimised MLP (FocalLoss, α = 0.25, γ = 2.0). (C) MLP architecture and sample sizes per cohort.
Figure 4. Confusion matrices and filtering funnels. (A) Normalised confusion matrices per cohort (rows = true class, columns = predicted). PRAD normal-class recall is —%; UCEC achieves full specificity with — samples. (B) Gene filtering funnel across all five cohorts.
Figure 5. Candidate gene biology and pathway context. (A) Germline (blue) and somatic (red) dN/dS for — BRCA candidates. TP53: germline —, somatic —; PTEN: germline —, somatic ∞. (B) Pathway grouping of — cross-cancer validated genes. (C) Known vs. novel candidate composition per cancer type.
Figure 6. Cross-cancer validated genes. (A) — genes identified in ≥ 2 cohorts. Bubble size reflects total non-synonymous mutations; TP53 appears in all five cohorts. (B) Binary co-occurrence matrix (gene × cancer). (C) Germline dN/dS of cross-cancer genes, ranked by conservation.
Figure 7. Two-scale evolutionary landscape. (A) Germline dN/dS (x) vs. somatic dN/dS (y). Shaded quadrant marks the candidate region (x < 0.3, y ≥ 1.5). Note: the binomial test over-estimates positive selection; thresholds were raised to dN/dS ≥ 1.5 (CI lower bound > 1.0) accordingly. (B) Aggregate filtering funnel: ~— DESeq2 genes → — candidates across — cohorts.
Figure 8. Candidate portfolio and clinical context. (A) Candidates per cohort; UCEC count (n = —) reflects MSI-driven hypermutation. (B) Somatic dN/dS matrix for — genes in ≥ 2 cohorts. (C) Druggability of cross-cancer candidates. (D) GSEA pathway enrichment for BRCA candidates.