Pipeline Overview

Our data pipeline transforms raw DNA methylation array data into actionable diagnostic insights through a series of carefully designed processing steps. The pipeline ensures data quality, removes technical artifacts, and extracts the most relevant epigenetic features for our transformer model.

Sample Collection
DNA Extraction
Array Processing
QC & Preprocessing
AI Analysis
Clinical Report
Figure 1: End-to-end pipeline from sample collection to clinical report generation

Data Acquisition

Our research utilized DNA methylation data from multiple sources to ensure robust model training and validation:

Public Repositories

We curated DNA methylation data from multiple Illumina 450K/EPIC array studies available on NCBI GEO, focusing on datasets with ME/CFS, Long COVID, and healthy control samples.

GEO Accession: GSE156063, GSE161450 120+ ME/CFS samples 80+ Long COVID samples 150+ healthy controls

Clinical Collaborations

We partnered with specialized ME/CFS and Long COVID clinics to collect additional samples with detailed clinical metadata, including symptom severity, duration, and comorbidities.

50+ ME/CFS samples with clinical data 40+ Long COVID samples with clinical data 60+ matched controls

Validation Cohort

We collected an independent validation cohort to test the final model performance, ensuring generalizability across different patient populations and technical conditions.

30 ME/CFS samples 30 Long COVID samples 30 healthy controls Processed at different lab facility

Quality Control & Preprocessing

Raw methylation data requires extensive preprocessing to remove technical artifacts and ensure reliable downstream analysis. Our pipeline implements a comprehensive QC workflow:

01

Raw Data Import

Import IDAT files from Illumina 450K/EPIC arrays using the minfi R package, which provides robust methods for handling raw methylation data.

# R code for importing IDAT files

rgSet <- read.metharray.exp(targets = targets)

mSet <- preprocessRaw(rgSet)
                    
02

Sample Quality Assessment

Evaluate sample quality metrics including detection p-values, bisulfite conversion efficiency, and overall signal intensity. Samples failing QC thresholds are removed.

Detection p-value
< 0.01 for >99% of probes
Bisulfite conversion
>98% efficiency
Median intensity
>10.5 (log2 scale)
03

Probe Filtering

Remove unreliable probes including those with high detection p-values, cross-reactive probes, and probes containing SNPs that could affect methylation measurements.

~40,000
Probes removed due to cross-reactivity
~20,000
Probes removed due to SNPs
~10,000
Probes with detection p-value > 0.01
04

Normalization

Apply normalization to correct for technical biases in the methylation data, including probe type bias (Infinium I vs II) and batch effects.

# Normalization using BMIQ method

mSetNorm <- preprocessFunnorm(rgSet)

mSetBMIQ <- BMIQ(mSetNorm)
                    
05

Batch Correction

Apply ComBat to remove batch effects while preserving biological variation related to disease status. This step is crucial when integrating data from multiple sources.

PCA plot showing before/after batch correction
Figure 2: PCA plot showing sample clustering before (left) and after (right) batch correction
06

Cell Type Composition Estimation

Estimate cell type proportions in blood samples using reference-based deconvolution methods, as cell type heterogeneity can confound methylation analyses.

CD8+ T cells
CD4+ T cells
NK cells
B cells
Monocytes
Granulocytes

Feature Selection

From the ~450,000 CpG sites measured on the arrays, we selected a refined feature set of approximately 1,280 CpG sites based on differential methylation analyses and biological relevance.

Differential Methylation Analysis

We identified differentially methylated positions (DMPs) between ME/CFS, Long COVID, and control samples using linear models with empirical Bayes moderation, adjusting for age, sex, and estimated cell type proportions.

~5,000
DMPs at FDR < 0.05
~2,500
DMPs with |Δβ| > 0.05

Biological Pathway Enrichment

We prioritized CpG sites in genes involved in pathways relevant to ME/CFS and Long COVID pathophysiology, including immune function, energy metabolism, and stress response.

Immune system regulation
HLA-DRB1, IFNG, IL10, FOXP3
Energy metabolism
PDK2, AMPK, PGC1A
Stress response
NR3C1, FKBP5, POMC
Viral response
IFITM3, OAS1, MX1

Machine Learning Feature Importance

We used feature importance scores from preliminary random forest models to identify CpG sites with high discriminative power for classification tasks.

Feature importance plot
Figure 3: Top 20 CpG sites ranked by feature importance

The final feature set of 1,280 CpG sites was selected to balance model performance, biological interpretability, and computational efficiency. This feature set captures the most relevant epigenetic signals while reducing noise and redundancy.

Data Transformation

Before feeding the data into our transformer model, we apply several transformations to optimize model performance:

Beta to M-Value Conversion

Convert beta-values (0-1 scale) to M-values (logit transformed) to improve statistical properties for downstream analysis. M-values have been shown to have better statistical characteristics for differential methylation analysis.

M = log2(β/(1-β))

Z-Score Standardization

Standardize M-values to have zero mean and unit variance across samples, which helps with model convergence and makes features comparable in scale.

Z = (M - μ) / σ

Tokenization

Group CpG sites into tokens based on genomic proximity or functional relationships, creating a structured input format for the transformer model.

Token 1
CpGs 1-32
Token 2
CpGs 33-64
Token 3
CpGs 65-96
...
...
Token 40
CpGs 1249-1280

Data Augmentation

For training data, we apply subtle augmentations to improve model robustness, including small random noise addition and simulated batch effects.

Gaussian noise
σ = 0.05
Missing value simulation
5% random masking
Batch effect simulation
Random shifts up to ±0.1

Model Training & Validation

Our transformer model is trained using a rigorous cross-validation approach to ensure robust performance and generalizability:

Self-Supervised Pretraining

The transformer is first pretrained using masked value prediction on the entire dataset (without using diagnostic labels). This allows the model to learn the inherent structure of methylation data.

Masking rate
15%
Learning rate
1e-4
Batch size
32
Epochs
100

Fine-tuning with Cross-Validation

The pretrained model is fine-tuned for the diagnostic classification task using 10-fold stratified cross-validation to ensure robust evaluation.

Fold 1
Fold 2
Fold 3
...
Fold 10

Hyperparameter Optimization

We use Bayesian optimization to tune key hyperparameters, including learning rate, dropout rate, and model architecture parameters.

Learning rate
1e-5 to 1e-3
Dropout rate
0.1 to 0.3
Attention heads
4, 8, or 12
Expert networks
2, 4, or 8

Independent Validation

The final model is evaluated on a completely independent validation cohort to assess real-world performance and generalizability.

Accuracy
97.06%
Macro F1-Score
0.96
Macro AUROC
0.98

Explore Our Technology

Learn more about the transformer architecture that powers our diagnostic system.