1. Data Sources & Preprocessing

RNA-seq HTSeq counts were obtained from the TCGA GDC portal for five cancer types:

Cancer TypeTotalTumorNormalRatio
BRCA (Breast)1,2181,1041149.7:1
BLCA (Bladder)4264071921.4:1
PRAD (Prostate)550498529.6:1
LUAD (Lung Adenocarcinoma)576517598.8:1
UCEC (Uterine)201177247.4:1

DESeq2 pre-filtering: Genes were retained only if |log2FC| > 1.5 with Benjamini–Hochberg adjusted p < 0.05, leaving ~13,660 genes for downstream modelling.

Class balancing: SMOTE oversampling is applied within each CV fold for cancer types with severe class imbalance (PRAD: 9.6:1, BLCA: 21.4:1). For other cancers, class-weight balancing is used instead.

Batch correction: ComBat batch correction was applied for PRAD to address adjacent-normal heterogeneity between sequencing batches.

2. ML Models

Three complementary model types are trained per cancer type using 5-fold stratified cross-validation:

ModelHyperparametersFeature Importance Method
Logistic Regression (L2) L2 penalty, C=1.0 |coefficients|
Random Forest 100–500 trees, max_depth=None Gini importance
MLP Neural Network Dynamic architecture (see below) gradient × input saliency

Dynamic MLP architecture:

  • 512 → 256 → 128 neurons when n > 600 samples
  • 256 → 128 neurons when n ≤ 600 samples

BatchNorm1d is applied between each hidden layer.

🎯
FocalLoss (α=0.25, γ=2.0) replaces BCEWithLogitsLoss to focus training on hard-to-classify normal samples, improving specificity for imbalanced cohorts.
3. Gene Signature Extraction

For each cancer type the gene signature is constructed as the union of top-N genes across all three models (LR, RF, MLP).

  • Pseudogene blacklist filter: Genes annotated as pseudogenes in Ensembl (biotype filtering) are removed before ranking.
  • Importance renormalisation: After filtering, importance scores are renormalised so they sum to 1.0 within each model.
4. Germline dN/dS (Conservation)

Cross-species comparison spanning ~90–400 Myr of divergence is used to quantify purifying selection on protein-coding genes.

Species panel: mouse, rat, dog, cow, opossum, zebrafish.

A weighted mean dN/dS is computed across species, weighted by divergence time. Genes with dN/dS < 0.3 are classified as under purifying selection, indicating they are functionally constrained and likely essential.

5. Somatic dN/dS (Selection)

Somatic dN/dS is calculated using a binomial exact test comparing observed nonsynonymous mutations to expected counts under neutral evolution (expected nonsynonymous proportion = 2.85/(1+2.85) ≈ 0.74). FDR correction (Benjamini–Hochberg) is applied to genes with dN/dS > 1. This is a simplified approach compared to the dNdScv method (Martincorena et al., 2017) which accounts for gene-specific covariates.

Genes under positive somatic selection must satisfy all three criteria:

  • dN/dS ≥ 1.5
  • 95% CI lower bound > 1.0
  • FDR q < 0.05 (TMB-adaptive: < 0.01 for hypermutated cancers)
ThresholdOld ValueNew ValueRationale
dN/dS minimum 1.0 1.5 Reduces false positives from near-neutral genes
CI lower bound > 1.0 Ensures statistical robustness
FDR threshold 0.05 0.05 (0.01 for high-TMB) TMB-adaptive: stricter threshold for hypermutated cancers (e.g., UCEC)
6. Integration & Candidate Identification

Candidate cancer dependencies are identified at the intersection of three evidence layers:

  • ML-predictive — gene appears in the top-N signature
  • Germline conserved — dN/dS < 0.3 across species
  • Somatic selected — dN/dS ≥ 1.5, CI > 1.0, FDR < 0.05 (FDR < 0.01 for high-TMB cancers)

Cross-cancer validation: Genes appearing in ≥ 2 cancer types receive higher confidence. Priority scoring is based on multi-criteria ranking across all three layers.

7. Statistical Framework
  • Balanced accuracy — primary classification metric (handles class imbalance by averaging per-class recall).
  • MCC (Matthews Correlation Coefficient) — single-number measure of binary classification quality that accounts for all four confusion-matrix cells.
  • Benjamini–Hochberg FDR correction applied to all multiple-testing scenarios (DESeq2, somatic dN/dS).
  • 95% confidence intervals for somatic dN/dS estimates, computed via profile likelihood.
8. Reproducibility
  • Random seed = 42 for all stochastic operations (train/test splits, model initialisation, SMOTE).
  • All thresholds centralised in config.py — no magic numbers in pipeline code.
  • Results are namespaced by cancer type (e.g. results/TCGA-BRCA/), enabling independent re-runs per cohort.
9. Limitations
  • Bulk RNA-seq only — does not capture single-cell heterogeneity within tumour or stromal compartments.
  • Limited normal samples for some cancer types (BLCA: 19 normals, UCEC: 24 normals), mitigated by SMOTE but not eliminated.
  • Somatic dN/dS depends on mutation count — low-mutation genes produce wide confidence intervals and may be missed.
  • Cross-species dN/dS may miss lineage-specific functional constraints that arose after the last common ancestor.
  • PRAD under-powered — prostate cancer has the lowest TMB in our cohort (median 2 nonsyn/gene), yielding only 1 candidate (TP53). Adjacent-normal tissue contamination also reduces classifier specificity.
  • Near-perfect AUC — AUC ≥ 0.999 for several cancers reflects the fundamental transcriptomic difference between tumour and normal tissue, not overfitting. 5-fold stratified CV with SMOTE applied only within folds prevents data leakage.
  • UCEC hypermutation — elevated TMB (median 37 nonsyn/gene) inflates the number of genes reaching statistical significance. TMB-adaptive FDR (q < 0.01) partially addresses this but 116 candidates should be interpreted cautiously.
  • Infinite dN/dS — genes with zero synonymous mutations yield dN/dS = ∞. These are retained when FDR is significant (e.g., TP53 in PRAD: 57 nonsyn, 0 syn), as the statistical test accounts for mutation counts.