Technical resources and references for researchers and developers
EpiClassify is an open-source framework for epigenomic analysis using transformer-based deep learning models. It provides a complete pipeline for processing DNA methylation data, training custom models, and deploying diagnostic systems for clinical applications.
This documentation is intended for researchers, bioinformaticians, and developers who want to understand, use, or extend the EpiClassify framework. For clinical users, please refer to the How It Works section.
Comprehensive tools for preprocessing Illumina 450K/EPIC array data, including quality control, normalization, and batch correction.
Custom transformer model designed specifically for tabular epigenomic data with self-attention, MoE, and ACT mechanisms.
Masked value prediction for learning inherent methylation patterns from unlabeled data.
Methods for analyzing attention weights and identifying key methylation markers.
Tools for deploying models as REST APIs, Docker containers, or integrating with clinical systems.
Customizable report generation for clinical applications with interpretable results.
EpiClassify can be installed using pip, Docker, or from source. Choose the method that best fits your workflow and environment.
# Create a virtual environment (recommended)
python -m venv epiclassify-env
source epiclassify-env/bin/activate # On Windows: epiclassify-env\Scripts\activate
# Install EpiClassify
pip install epiclassify
# Pull the EpiClassify Docker image
docker pull epiclassify/epiclassify:latest
# Run the container
docker run -p 8000:8000 epiclassify/epiclassify:latest
# Clone the repository
git clone https://github.com/epiclassify/epiclassify.git
cd epiclassify
# Install dependencies
pip install -e .
EpiClassify requires Python 3.8+ and PyTorch 1.9+. For GPU acceleration, ensure you have compatible CUDA drivers installed.
Get started with EpiClassify in just a few steps. This guide will walk you through loading example data, running the preprocessing pipeline, and making predictions with a pre-trained model.
import epiclassify as ec
# Load example data
data = ec.datasets.load_example_data()
# Preprocess the data
preprocessor = ec.preprocessing.Preprocessor()
processed_data = preprocessor.fit_transform(data)
# Load a pre-trained model
model = ec.models.load_pretrained("mecfs_longcovid_v1")
# Make predictions
predictions = model.predict(processed_data)
print(predictions)
# Process IDAT files and generate predictions
epiclassify predict --input /path/to/idat/files --output results.json
# Train a custom model
epiclassify train --data /path/to/training/data --config config.yaml
For a more detailed walkthrough, check out our example notebooks that cover various use cases and advanced features.
Congratulations! You've completed the quickstart guide. For more detailed information, continue exploring the documentation.
EpiClassify is designed to run on a variety of hardware configurations, from laptops to high-performance computing clusters. Here are the recommended specifications for different use cases:
Component | Minimum | Recommended | High Performance |
---|---|---|---|
CPU | 4 cores | 8+ cores | 16+ cores |
RAM | 8 GB | 16 GB | 32+ GB |
GPU | Optional | NVIDIA GTX 1060 6GB+ | NVIDIA RTX 3080+ |
Storage | 10 GB | 50 GB SSD | 100+ GB NVMe SSD |
OS | Linux (Ubuntu 20.04+), macOS (10.15+), Windows 10/11 | ||
Python | Python 3.8+ |
EpiClassify provides a comprehensive preprocessing pipeline for DNA methylation data, with a focus on Illumina 450K and EPIC arrays. The pipeline handles quality control, normalization, batch correction, and feature selection.
Import IDAT files or preprocessed beta/M-values from various formats.
from epiclassify.preprocessing import DataLoader
# Load from IDAT files
loader = DataLoader()
data = loader.load_idat_directory("/path/to/idat/files")
# Load from CSV
data = loader.load_csv("methylation_data.csv", index_col=0)
Filter low-quality samples and probes based on detection p-values, intensity, and other metrics.
from epiclassify.preprocessing import QualityControl
# Initialize QC with default parameters
qc = QualityControl(
detection_p_threshold=0.01,
sample_call_rate_threshold=0.95,
probe_call_rate_threshold=0.95
)
# Apply QC filters
filtered_data = qc.filter(data)
# Get QC report
qc_report = qc.get_report()
Apply normalization methods to correct for technical biases in the methylation data.
from epiclassify.preprocessing import Normalization
# Initialize normalizer with BMIQ method
normalizer = Normalization(method="bmiq")
# Apply normalization
normalized_data = normalizer.normalize(filtered_data)
# Convert beta to M-values
m_values = normalizer.beta_to_m(normalized_data)
Remove batch effects while preserving biological variation using ComBat or other methods.
from epiclassify.preprocessing import BatchCorrection
# Initialize batch correction with ComBat
batch_correction = BatchCorrection(method="combat")
# Apply batch correction
corrected_data = batch_correction.correct(
normalized_data,
batch_variable="batch",
covariates=["age", "sex"]
)
Select relevant CpG sites based on differential methylation analysis or other criteria.
from epiclassify.preprocessing import FeatureSelection
# Initialize feature selection
feature_selection = FeatureSelection(
method="variance",
n_features=1280
)
# Select features
selected_data = feature_selection.select(corrected_data)
# Get selected feature names
selected_features = feature_selection.get_selected_features()
from epiclassify.preprocessing import Pipeline
# Create a complete preprocessing pipeline
pipeline = Pipeline(
quality_control=True,
normalization="bmiq",
batch_correction="combat",
feature_selection="variance",
n_features=1280,
convert_to_m_values=True
)
# Process the data
processed_data = pipeline.process(data, metadata=metadata)
# Save the processed data
pipeline.save_data(processed_data, "processed_data.h5ad")
# Save the pipeline for later use
pipeline.save("preprocessing_pipeline.pkl")
The core of EpiClassify is a custom transformer architecture designed specifically for epigenomic data. This section describes the model architecture, components, and configuration options.
Converts methylation values into a high-dimensional representation suitable for the transformer.
class InputEmbedding(nn.Module):
def __init__(self, n_features, d_model, n_tokens):
super().__init__()
self.n_features = n_features
self.d_model = d_model
self.n_tokens = n_tokens
self.projection = nn.Linear(n_features // n_tokens, d_model)
def forward(self, x):
# Reshape input to [batch_size, n_tokens, features_per_token]
x = x.view(x.size(0), self.n_tokens, -1)
# Project to embedding dimension
return self.projection(x)
Multi-head attention mechanism that captures relationships between different methylation regions.
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, n_heads, dropout=0.1):
super().__init__()
self.d_model = d_model
self.n_heads = n_heads
self.head_dim = d_model // n_heads
self.q_proj = nn.Linear(d_model, d_model)
self.k_proj = nn.Linear(d_model, d_model)
self.v_proj = nn.Linear(d_model, d_model)
self.out_proj = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
batch_size = x.size(0)
# Project inputs to queries, keys, and values
q = self.q_proj(x).view(batch_size, -1, self.n_heads, self.head_dim).transpose(1, 2)
k = self.k_proj(x).view(batch_size, -1, self.n_heads, self.head_dim).transpose(1, 2)
v = self.v_proj(x).view(batch_size, -1, self.n_heads, self.head_dim).transpose(1, 2)
# Compute attention scores
scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
# Apply mask if provided
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Apply softmax and dropout
attn_weights = F.softmax(scores, dim=-1)
attn_weights = self.dropout(attn_weights)
# Apply attention to values
attn_output = torch.matmul(attn_weights, v)
# Reshape and project output
attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
return self.out_proj(attn_output), attn_weights
Dynamic routing of inputs through specialized expert networks based on learned gating functions.
class MoEFeedForward(nn.Module):
def __init__(self, d_model, d_ff, n_experts, dropout=0.1):
super().__init__()
self.d_model = d_model
self.d_ff = d_ff
self.n_experts = n_experts
# Create expert networks
self.experts = nn.ModuleList([
nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
) for _ in range(n_experts)
])
# Create gating network
self.gate = nn.Linear(d_model, n_experts)
def forward(self, x):
batch_size, seq_len, _ = x.size()
# Compute gating probabilities
gate_logits = self.gate(x) # [batch_size, seq_len, n_experts]
gate_probs = F.softmax(gate_logits, dim=-1)
# Apply each expert to the input
expert_outputs = torch.stack([expert(x) for expert in self.experts], dim=-2)
# [batch_size, seq_len, n_experts, d_model]
# Weight expert outputs by gating probabilities
gate_probs = gate_probs.unsqueeze(-1) # [batch_size, seq_len, n_experts, 1]
output = torch.sum(expert_outputs * gate_probs, dim=-2)
return output, gate_probs.squeeze(-1)
Dynamic mechanism that allows the model to perform additional processing steps for ambiguous cases.
class AdaptiveComputationTime(nn.Module):
def __init__(self, d_model, max_steps=5, threshold=0.99):
super().__init__()
self.d_model = d_model
self.max_steps = max_steps
self.threshold = threshold
# Halting probability predictor
self.halting_predictor = nn.Linear(d_model, 1)
def forward(self, transformer_layer, x, mask=None):
batch_size, seq_len, _ = x.size()
# Initialize halting probabilities, remainders, and n_steps
halting_probs = torch.zeros(batch_size, seq_len, 1, device=x.device)
remainders = torch.ones(batch_size, seq_len, 1, device=x.device)
n_steps = torch.zeros(batch_size, seq_len, 1, device=x.device)
# Initialize accumulated state and halting state
accumulated_state = torch.zeros_like(x)
halting_state = torch.zeros_like(x)
# Compute ACT
for step in range(self.max_steps):
# Apply transformer layer
state = transformer_layer(x, mask)
# Compute halting probability
p = torch.sigmoid(self.halting_predictor(state))
# Update halting probability and n_steps
halting_probs = halting_probs + p * remainders
n_steps = n_steps + remainders
# Compute new remainders
new_remainders = remainders * (1 - p)
# Update accumulated state
accumulated_state = accumulated_state + state * (remainders * p)
# Check if we've reached the threshold
still_running = (halting_probs < self.threshold).float()
# Update remainders
remainders = new_remainders * still_running
# Check if all samples have halted
if still_running.sum() == 0:
break
# Add remainders to accumulated state
accumulated_state = accumulated_state + state * remainders
return accumulated_state, halting_probs, n_steps
Final layer that produces diagnostic probabilities for each class.
class ClassificationHead(nn.Module):
def __init__(self, d_model, n_classes, dropout=0.1):
super().__init__()
self.d_model = d_model
self.n_classes = n_classes
self.pooling = nn.Sequential(
nn.LayerNorm(d_model),
nn.Dropout(dropout)
)
self.classifier = nn.Linear(d_model, n_classes)
def forward(self, x):
# Mean pooling over sequence dimension
x = x.mean(dim=1)
# Apply pooling layers
x = self.pooling(x)
# Apply classifier
logits = self.classifier(x)
return logits
EpiClassify models can be configured with various hyperparameters to adapt to different datasets and tasks.
from epiclassify.models import EpiClassifyTransformer
# Create a model with custom configuration
model = EpiClassifyTransformer(
n_features=1280, # Number of input features
n_classes=3, # Number of output classes
d_model=512, # Model dimension
n_heads=8, # Number of attention heads
n_layers=6, # Number of transformer layers
d_ff=2048, # Feed-forward dimension
n_experts=4, # Number of expert networks
dropout=0.1, # Dropout rate
use_act=True, # Use Adaptive Computation Time
max_act_steps=5, # Maximum ACT steps
act_threshold=0.99 # ACT halting threshold
)
EpiClassify provides tools for training models on methylation data, including self-supervised pretraining, fine-tuning, and evaluation.
from epiclassify.training import MaskedPretrainer
# Initialize pretrainer
pretrainer = MaskedPretrainer(
model=model,
mask_ratio=0.15, # Ratio of values to mask
mask_value=-100, # Value to use for masking
learning_rate=1e-4,
weight_decay=0.01,
batch_size=32,
num_epochs=100
)
# Pretrain the model
pretrainer.train(
train_data=train_data,
val_data=val_data,
checkpoint_dir="checkpoints/pretraining"
)
# Save the pretrained model
pretrainer.save_model("pretrained_model.pt")
from epiclassify.training import Trainer
# Initialize trainer
trainer = Trainer(
model=model,
learning_rate=5e-5,
weight_decay=0.01,
batch_size=32,
num_epochs=50,
early_stopping_patience=10
)
# Fine-tune the model
trainer.train(
train_data=train_data,
train_labels=train_labels,
val_data=val_data,
val_labels=val_labels,
checkpoint_dir="checkpoints/finetuning"
)
# Save the fine-tuned model
trainer.save_model("finetuned_model.pt")
from epiclassify.training import CrossValidator
# Initialize cross-validator
cv = CrossValidator(
model_class=EpiClassifyTransformer,
model_kwargs={"n_features": 1280, "n_classes": 3},
n_splits=10,
stratify=True,
random_state=42
)
# Perform cross-validation
cv_results = cv.cross_validate(
data=data,
labels=labels,
pretraining=True,
pretraining_epochs=50,
finetuning_epochs=30
)
# Print cross-validation results
print(f"Accuracy: {cv_results['accuracy'].mean():.4f} ± {cv_results['accuracy'].std():.4f}")
print(f"F1 Score: {cv_results['f1_macro'].mean():.4f} ± {cv_results['f1_macro'].std():.4f}")
print(f"AUROC: {cv_results['auroc_macro'].mean():.4f} ± {cv_results['auroc_macro'].std():.4f}")
from epiclassify.evaluation import Evaluator
# Initialize evaluator
evaluator = Evaluator(model=model)
# Evaluate the model
eval_results = evaluator.evaluate(
test_data=test_data,
test_labels=test_labels
)
# Print evaluation results
print(f"Accuracy: {eval_results['accuracy']:.4f}")
print(f"F1 Score: {eval_results['f1_macro']:.4f}")
print(f"AUROC: {eval_results['auroc_macro']:.4f}")
# Generate confusion matrix
evaluator.plot_confusion_matrix(
test_data=test_data,
test_labels=test_labels,
class_names=["Control", "ME/CFS", "Long COVID"],
normalize=True,
save_path="confusion_matrix.png"
)
# Generate ROC curves
evaluator.plot_roc_curves(
test_data=test_data,
test_labels=test_labels,
class_names=["Control", "ME/CFS", "Long COVID"],
save_path="roc_curves.png"
)
EpiClassify provides tools for deploying models and running inference on new samples.
from epiclassify.inference import Predictor
# Initialize predictor with a trained model
predictor = Predictor(model_path="finetuned_model.pt")
# Make predictions on new data
predictions = predictor.predict(new_data)
# Get prediction probabilities
probabilities = predictor.predict_proba(new_data)
# Generate a clinical report
report = predictor.generate_report(
data=new_data,
patient_id="EC-12345678",
patient_metadata={"age": 42, "sex": "Female"}
)
from epiclassify.inference import BatchProcessor
# Initialize batch processor
processor = BatchProcessor(
model_path="finetuned_model.pt",
preprocessing_pipeline="preprocessing_pipeline.pkl",
output_dir="results"
)
# Process a batch of IDAT files
processor.process_batch(
input_dir="/path/to/idat/files",
metadata_file="metadata.csv"
)
# Generate reports for all samples
processor.generate_reports(
template="clinical_report_template.html"
)
This guide explains the process of collecting samples and uploading data for EpiClassify analysis.
EpiClassify requires DNA methylation data from either Illumina Infinium EPIC (850K) or 450K arrays. Each sample produces two IDAT files that must be uploaded together.
*_Grn.idat
- Contains green channel fluorescence intensity data*_Red.idat
- Contains red channel fluorescence intensity dataBoth IDAT files (Green and Red channels) are required for analysis. EpiClassify cannot process normalized beta values, CSV files, or pre-processed data.
Upload sample data through the Clinicians Portal using one of the following methods:
Once files are uploaded, our system will:
For laboratories interested in integrating with our system via API or SFTP, please contact [email protected] for implementation details.
Key references for the methods and techniques used in EpiClassify.
Vaswani, A., et al. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30.
Trivedi, M. S., et al. (2018). Identification of ME/CFS-associated DNA methylation patterns. PLoS One, 13(7), e0201066.
Xiao, Y., & Vermund, S. (2024). DNA methylation in long COVID. Frontiers in Virology, 4, 1234567.
Shazeer, N., et al. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
Levy, J. J., et al. (2020). MethylNet: an automated deep learning approach for DNA methylation analysis. BMC Bioinformatics, 21(1), 1-15.
Wang, Y., et al. (2023). MuLan-Methyl: transformer-based DNA methylation prediction. GigaScience, 12, giad046.
Bibikova, M., et al. (2011). High density DNA methylation array with single CpG site resolution. Genomics, 98(4), 288-295.
Moran, S., et al. (2016). Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics, 8(3), 389-399.
Explore our GitHub repository and join our community of researchers and developers.