Pipeline Overview
Our data pipeline transforms raw DNA methylation array data into actionable diagnostic insights through a series of carefully designed processing steps. The pipeline ensures data quality, removes technical artifacts, and extracts the most relevant epigenetic features for our transformer model.
Data Acquisition
Our research utilized DNA methylation data from multiple sources to ensure robust model training and validation:
Public Repositories
We curated DNA methylation data from multiple Illumina 450K/EPIC array studies available on NCBI GEO, focusing on datasets with ME/CFS, Long COVID, and healthy control samples.
Clinical Collaborations
We partnered with specialized ME/CFS and Long COVID clinics to collect additional samples with detailed clinical metadata, including symptom severity, duration, and comorbidities.
Validation Cohort
We collected an independent validation cohort to test the final model performance, ensuring generalizability across different patient populations and technical conditions.
Quality Control & Preprocessing
Raw methylation data requires extensive preprocessing to remove technical artifacts and ensure reliable downstream analysis. Our pipeline implements a comprehensive QC workflow:
Raw Data Import
Import IDAT files from Illumina 450K/EPIC arrays using the minfi R package, which provides robust methods for handling raw methylation data.
# R code for importing IDAT files
rgSet <- read.metharray.exp(targets = targets)
mSet <- preprocessRaw(rgSet)
Sample Quality Assessment
Evaluate sample quality metrics including detection p-values, bisulfite conversion efficiency, and overall signal intensity. Samples failing QC thresholds are removed.
Probe Filtering
Remove unreliable probes including those with high detection p-values, cross-reactive probes, and probes containing SNPs that could affect methylation measurements.
Normalization
Apply normalization to correct for technical biases in the methylation data, including probe type bias (Infinium I vs II) and batch effects.
# Normalization using BMIQ method
mSetNorm <- preprocessFunnorm(rgSet)
mSetBMIQ <- BMIQ(mSetNorm)
Batch Correction
Apply ComBat to remove batch effects while preserving biological variation related to disease status. This step is crucial when integrating data from multiple sources.
Cell Type Composition Estimation
Estimate cell type proportions in blood samples using reference-based deconvolution methods, as cell type heterogeneity can confound methylation analyses.
Feature Selection
From the ~450,000 CpG sites measured on the arrays, we selected a refined feature set of approximately 1,280 CpG sites based on differential methylation analyses and biological relevance.
Differential Methylation Analysis
We identified differentially methylated positions (DMPs) between ME/CFS, Long COVID, and control samples using linear models with empirical Bayes moderation, adjusting for age, sex, and estimated cell type proportions.
Biological Pathway Enrichment
We prioritized CpG sites in genes involved in pathways relevant to ME/CFS and Long COVID pathophysiology, including immune function, energy metabolism, and stress response.
Machine Learning Feature Importance
We used feature importance scores from preliminary random forest models to identify CpG sites with high discriminative power for classification tasks.
The final feature set of 1,280 CpG sites was selected to balance model performance, biological interpretability, and computational efficiency. This feature set captures the most relevant epigenetic signals while reducing noise and redundancy.
Data Transformation
Before feeding the data into our transformer model, we apply several transformations to optimize model performance:
Beta to M-Value Conversion
Convert beta-values (0-1 scale) to M-values (logit transformed) to improve statistical properties for downstream analysis. M-values have been shown to have better statistical characteristics for differential methylation analysis.
Z-Score Standardization
Standardize M-values to have zero mean and unit variance across samples, which helps with model convergence and makes features comparable in scale.
Tokenization
Group CpG sites into tokens based on genomic proximity or functional relationships, creating a structured input format for the transformer model.
Data Augmentation
For training data, we apply subtle augmentations to improve model robustness, including small random noise addition and simulated batch effects.
Model Training & Validation
Our transformer model is trained using a rigorous cross-validation approach to ensure robust performance and generalizability:
Self-Supervised Pretraining
The transformer is first pretrained using masked value prediction on the entire dataset (without using diagnostic labels). This allows the model to learn the inherent structure of methylation data.
Fine-tuning with Cross-Validation
The pretrained model is fine-tuned for the diagnostic classification task using 10-fold stratified cross-validation to ensure robust evaluation.
Hyperparameter Optimization
We use Bayesian optimization to tune key hyperparameters, including learning rate, dropout rate, and model architecture parameters.
Independent Validation
The final model is evaluated on a completely independent validation cohort to assess real-world performance and generalizability.