Documentation

Overview

EpiClassify is an open-source framework for epigenomic analysis using transformer-based deep learning models. It provides a complete pipeline for processing DNA methylation data, training custom models, and deploying diagnostic systems for clinical applications.

This documentation is intended for researchers, bioinformaticians, and developers who want to understand, use, or extend the EpiClassify framework. For clinical users, please refer to the How It Works section.

Key Features

Methylation Data Processing

Comprehensive tools for preprocessing Illumina 450K/EPIC array data, including quality control, normalization, and batch correction.

Transformer Architecture

Custom transformer model designed specifically for tabular epigenomic data with self-attention, MoE, and ACT mechanisms.

Self-Supervised Pretraining

Masked value prediction for learning inherent methylation patterns from unlabeled data.

Interpretability Tools

Methods for analyzing attention weights and identifying key methylation markers.

Deployment Options

Tools for deploying models as REST APIs, Docker containers, or integrating with clinical systems.

Clinical Reporting

Customizable report generation for clinical applications with interpretable results.

Installation

EpiClassify can be installed using pip, Docker, or from source. Choose the method that best fits your workflow and environment.

Using pip

# Create a virtual environment (recommended)
python -m venv epiclassify-env
source epiclassify-env/bin/activate  # On Windows: epiclassify-env\Scripts\activate

# Install EpiClassify
pip install epiclassify

Using Docker

# Pull the EpiClassify Docker image
docker pull epiclassify/epiclassify:latest

# Run the container
docker run -p 8000:8000 epiclassify/epiclassify:latest

From Source

# Clone the repository
git clone https://github.com/epiclassify/epiclassify.git
cd epiclassify

# Install dependencies
pip install -e .

EpiClassify requires Python 3.8+ and PyTorch 1.9+. For GPU acceleration, ensure you have compatible CUDA drivers installed.

Quickstart Guide

Get started with EpiClassify in just a few steps. This guide will walk you through loading example data, running the preprocessing pipeline, and making predictions with a pre-trained model.

Basic Usage

import epiclassify as ec

# Load example data
data = ec.datasets.load_example_data()

# Preprocess the data
preprocessor = ec.preprocessing.Preprocessor()
processed_data = preprocessor.fit_transform(data)

# Load a pre-trained model
model = ec.models.load_pretrained("mecfs_longcovid_v1")

# Make predictions
predictions = model.predict(processed_data)
print(predictions)

Using the CLI

# Process IDAT files and generate predictions
epiclassify predict --input /path/to/idat/files --output results.json

# Train a custom model
epiclassify train --data /path/to/training/data --config config.yaml

Example Notebook

For a more detailed walkthrough, check out our example notebooks that cover various use cases and advanced features.

Congratulations! You've completed the quickstart guide. For more detailed information, continue exploring the documentation.

System Requirements

EpiClassify is designed to run on a variety of hardware configurations, from laptops to high-performance computing clusters. Here are the recommended specifications for different use cases:

Component	Minimum	Recommended	High Performance
CPU	4 cores	8+ cores	16+ cores
RAM	8 GB	16 GB	32+ GB
GPU	Optional	NVIDIA GTX 1060 6GB+	NVIDIA RTX 3080+
Storage	10 GB	50 GB SSD	100+ GB NVMe SSD
OS	Linux (Ubuntu 20.04+), macOS (10.15+), Windows 10/11
Python	Python 3.8+

Software Dependencies

Core Dependencies

PyTorch 1.9+
NumPy 1.20+
pandas 1.3+
scikit-learn 1.0+

Preprocessing Dependencies

minfi (R package)
rpy2 2.9+
scipy 1.7+

Visualization Dependencies

matplotlib 3.4+
seaborn 0.11+
plotly 5.0+

API Dependencies

FastAPI 0.68+
uvicorn 0.15+
pydantic 1.8+

Data Preprocessing

EpiClassify provides a comprehensive preprocessing pipeline for DNA methylation data, with a focus on Illumina 450K and EPIC arrays. The pipeline handles quality control, normalization, batch correction, and feature selection.

Preprocessing Pipeline

Data Import

Import IDAT files or preprocessed beta/M-values from various formats.

from epiclassify.preprocessing import DataLoader

# Load from IDAT files
loader = DataLoader()
data = loader.load_idat_directory("/path/to/idat/files")

# Load from CSV
data = loader.load_csv("methylation_data.csv", index_col=0)

Quality Control

Filter low-quality samples and probes based on detection p-values, intensity, and other metrics.

from epiclassify.preprocessing import QualityControl

# Initialize QC with default parameters
qc = QualityControl(
    detection_p_threshold=0.01,
    sample_call_rate_threshold=0.95,
    probe_call_rate_threshold=0.95
)

# Apply QC filters
filtered_data = qc.filter(data)

# Get QC report
qc_report = qc.get_report()

Normalization

Apply normalization methods to correct for technical biases in the methylation data.

from epiclassify.preprocessing import Normalization

# Initialize normalizer with BMIQ method
normalizer = Normalization(method="bmiq")

# Apply normalization
normalized_data = normalizer.normalize(filtered_data)

# Convert beta to M-values
m_values = normalizer.beta_to_m(normalized_data)

Batch Correction

Remove batch effects while preserving biological variation using ComBat or other methods.

from epiclassify.preprocessing import BatchCorrection

# Initialize batch correction with ComBat
batch_correction = BatchCorrection(method="combat")

# Apply batch correction
corrected_data = batch_correction.correct(
    normalized_data,
    batch_variable="batch",
    covariates=["age", "sex"]
)

Feature Selection

Select relevant CpG sites based on differential methylation analysis or other criteria.

from epiclassify.preprocessing import FeatureSelection

# Initialize feature selection
feature_selection = FeatureSelection(
    method="variance",
    n_features=1280
)

# Select features
selected_data = feature_selection.select(corrected_data)

# Get selected feature names
selected_features = feature_selection.get_selected_features()

Complete Pipeline Example

from epiclassify.preprocessing import Pipeline

# Create a complete preprocessing pipeline
pipeline = Pipeline(
    quality_control=True,
    normalization="bmiq",
    batch_correction="combat",
    feature_selection="variance",
    n_features=1280,
    convert_to_m_values=True
)

# Process the data
processed_data = pipeline.process(data, metadata=metadata)

# Save the processed data
pipeline.save_data(processed_data, "processed_data.h5ad")

# Save the pipeline for later use
pipeline.save("preprocessing_pipeline.pkl")

Transformer Model

The core of EpiClassify is a custom transformer architecture designed specifically for epigenomic data. This section describes the model architecture, components, and configuration options.

Architecture Overview

Figure 1: EpiClassify transformer architecture with key components highlighted

Key Components

Input Embedding

Converts methylation values into a high-dimensional representation suitable for the transformer.

class InputEmbedding(nn.Module):
    def __init__(self, n_features, d_model, n_tokens):
        super().__init__()
        self.n_features = n_features
        self.d_model = d_model
        self.n_tokens = n_tokens
        self.projection = nn.Linear(n_features // n_tokens, d_model)
        
    def forward(self, x):
        # Reshape input to [batch_size, n_tokens, features_per_token]
        x = x.view(x.size(0), self.n_tokens, -1)
        # Project to embedding dimension
        return self.projection(x)

Self-Attention

Multi-head attention mechanism that captures relationships between different methylation regions.

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        
        self.q_proj = nn.Linear(d_model, d_model)
        self.k_proj = nn.Linear(d_model, d_model)
        self.v_proj = nn.Linear(d_model, d_model)
        self.out_proj = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        batch_size = x.size(0)
        
        # Project inputs to queries, keys, and values
        q = self.q_proj(x).view(batch_size, -1, self.n_heads, self.head_dim).transpose(1, 2)
        k = self.k_proj(x).view(batch_size, -1, self.n_heads, self.head_dim).transpose(1, 2)
        v = self.v_proj(x).view(batch_size, -1, self.n_heads, self.head_dim).transpose(1, 2)
        
        # Compute attention scores
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        
        # Apply mask if provided
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # Apply softmax and dropout
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        # Apply attention to values
        attn_output = torch.matmul(attn_weights, v)
        
        # Reshape and project output
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.out_proj(attn_output), attn_weights

Mixture-of-Experts Feed-Forward Network

Dynamic routing of inputs through specialized expert networks based on learned gating functions.

class MoEFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, n_experts, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.d_ff = d_ff
        self.n_experts = n_experts
        
        # Create expert networks
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(d_model, d_ff),
                nn.GELU(),
                nn.Dropout(dropout),
                nn.Linear(d_ff, d_model),
                nn.Dropout(dropout)
            ) for _ in range(n_experts)
        ])
        
        # Create gating network
        self.gate = nn.Linear(d_model, n_experts)
        
    def forward(self, x):
        batch_size, seq_len, _ = x.size()
        
        # Compute gating probabilities
        gate_logits = self.gate(x)  # [batch_size, seq_len, n_experts]
        gate_probs = F.softmax(gate_logits, dim=-1)
        
        # Apply each expert to the input
        expert_outputs = torch.stack([expert(x) for expert in self.experts], dim=-2)
        # [batch_size, seq_len, n_experts, d_model]
        
        # Weight expert outputs by gating probabilities
        gate_probs = gate_probs.unsqueeze(-1)  # [batch_size, seq_len, n_experts, 1]
        output = torch.sum(expert_outputs * gate_probs, dim=-2)
        
        return output, gate_probs.squeeze(-1)

Adaptive Computation Time

Dynamic mechanism that allows the model to perform additional processing steps for ambiguous cases.

class AdaptiveComputationTime(nn.Module):
    def __init__(self, d_model, max_steps=5, threshold=0.99):
        super().__init__()
        self.d_model = d_model
        self.max_steps = max_steps
        self.threshold = threshold
        
        # Halting probability predictor
        self.halting_predictor = nn.Linear(d_model, 1)
        
    def forward(self, transformer_layer, x, mask=None):
        batch_size, seq_len, _ = x.size()
        
        # Initialize halting probabilities, remainders, and n_steps
        halting_probs = torch.zeros(batch_size, seq_len, 1, device=x.device)
        remainders = torch.ones(batch_size, seq_len, 1, device=x.device)
        n_steps = torch.zeros(batch_size, seq_len, 1, device=x.device)
        
        # Initialize accumulated state and halting state
        accumulated_state = torch.zeros_like(x)
        halting_state = torch.zeros_like(x)
        
        # Compute ACT
        for step in range(self.max_steps):
            # Apply transformer layer
            state = transformer_layer(x, mask)
            
            # Compute halting probability
            p = torch.sigmoid(self.halting_predictor(state))
            
            # Update halting probability and n_steps
            halting_probs = halting_probs + p * remainders
            n_steps = n_steps + remainders
            
            # Compute new remainders
            new_remainders = remainders * (1 - p)
            
            # Update accumulated state
            accumulated_state = accumulated_state + state * (remainders * p)
            
            # Check if we've reached the threshold
            still_running = (halting_probs < self.threshold).float()
            
            # Update remainders
            remainders = new_remainders * still_running
            
            # Check if all samples have halted
            if still_running.sum() == 0:
                break
        
        # Add remainders to accumulated state
        accumulated_state = accumulated_state + state * remainders
        
        return accumulated_state, halting_probs, n_steps

Classification Head

Final layer that produces diagnostic probabilities for each class.

class ClassificationHead(nn.Module):
    def __init__(self, d_model, n_classes, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.n_classes = n_classes
        
        self.pooling = nn.Sequential(
            nn.LayerNorm(d_model),
            nn.Dropout(dropout)
        )
        
        self.classifier = nn.Linear(d_model, n_classes)
        
    def forward(self, x):
        # Mean pooling over sequence dimension
        x = x.mean(dim=1)
        
        # Apply pooling layers
        x = self.pooling(x)
        
        # Apply classifier
        logits = self.classifier(x)
        
        return logits

Model Configuration

EpiClassify models can be configured with various hyperparameters to adapt to different datasets and tasks.

from epiclassify.models import EpiClassifyTransformer

# Create a model with custom configuration
model = EpiClassifyTransformer(
    n_features=1280,          # Number of input features
    n_classes=3,              # Number of output classes
    d_model=512,              # Model dimension
    n_heads=8,                # Number of attention heads
    n_layers=6,               # Number of transformer layers
    d_ff=2048,                # Feed-forward dimension
    n_experts=4,              # Number of expert networks
    dropout=0.1,              # Dropout rate
    use_act=True,             # Use Adaptive Computation Time
    max_act_steps=5,          # Maximum ACT steps
    act_threshold=0.99        # ACT halting threshold
)

Training & Evaluation

EpiClassify provides tools for training models on methylation data, including self-supervised pretraining, fine-tuning, and evaluation.

Self-Supervised Pretraining

from epiclassify.training import MaskedPretrainer

# Initialize pretrainer
pretrainer = MaskedPretrainer(
    model=model,
    mask_ratio=0.15,          # Ratio of values to mask
    mask_value=-100,          # Value to use for masking
    learning_rate=1e-4,
    weight_decay=0.01,
    batch_size=32,
    num_epochs=100
)

# Pretrain the model
pretrainer.train(
    train_data=train_data,
    val_data=val_data,
    checkpoint_dir="checkpoints/pretraining"
)

# Save the pretrained model
pretrainer.save_model("pretrained_model.pt")

Fine-tuning

from epiclassify.training import Trainer

# Initialize trainer
trainer = Trainer(
    model=model,
    learning_rate=5e-5,
    weight_decay=0.01,
    batch_size=32,
    num_epochs=50,
    early_stopping_patience=10
)

# Fine-tune the model
trainer.train(
    train_data=train_data,
    train_labels=train_labels,
    val_data=val_data,
    val_labels=val_labels,
    checkpoint_dir="checkpoints/finetuning"
)

# Save the fine-tuned model
trainer.save_model("finetuned_model.pt")

Cross-Validation

from epiclassify.training import CrossValidator

# Initialize cross-validator
cv = CrossValidator(
    model_class=EpiClassifyTransformer,
    model_kwargs={"n_features": 1280, "n_classes": 3},
    n_splits=10,
    stratify=True,
    random_state=42
)

# Perform cross-validation
cv_results = cv.cross_validate(
    data=data,
    labels=labels,
    pretraining=True,
    pretraining_epochs=50,
    finetuning_epochs=30
)

# Print cross-validation results
print(f"Accuracy: {cv_results['accuracy'].mean():.4f} ± {cv_results['accuracy'].std():.4f}")
print(f"F1 Score: {cv_results['f1_macro'].mean():.4f} ± {cv_results['f1_macro'].std():.4f}")
print(f"AUROC: {cv_results['auroc_macro'].mean():.4f} ± {cv_results['auroc_macro'].std():.4f}")

Evaluation

from epiclassify.evaluation import Evaluator

# Initialize evaluator
evaluator = Evaluator(model=model)

# Evaluate the model
eval_results = evaluator.evaluate(
    test_data=test_data,
    test_labels=test_labels
)

# Print evaluation results
print(f"Accuracy: {eval_results['accuracy']:.4f}")
print(f"F1 Score: {eval_results['f1_macro']:.4f}")
print(f"AUROC: {eval_results['auroc_macro']:.4f}")

# Generate confusion matrix
evaluator.plot_confusion_matrix(
    test_data=test_data,
    test_labels=test_labels,
    class_names=["Control", "ME/CFS", "Long COVID"],
    normalize=True,
    save_path="confusion_matrix.png"
)

# Generate ROC curves
evaluator.plot_roc_curves(
    test_data=test_data,
    test_labels=test_labels,
    class_names=["Control", "ME/CFS", "Long COVID"],
    save_path="roc_curves.png"
)

Inference Pipeline

EpiClassify provides tools for deploying models and running inference on new samples.

Basic Inference

from epiclassify.inference import Predictor

# Initialize predictor with a trained model
predictor = Predictor(model_path="finetuned_model.pt")

# Make predictions on new data
predictions = predictor.predict(new_data)

# Get prediction probabilities
probabilities = predictor.predict_proba(new_data)

# Generate a clinical report
report = predictor.generate_report(
    data=new_data,
    patient_id="EC-12345678",
    patient_metadata={"age": 42, "sex": "Female"}
)

Batch Processing

from epiclassify.inference import BatchProcessor

# Initialize batch processor
processor = BatchProcessor(
    model_path="finetuned_model.pt",
    preprocessing_pipeline="preprocessing_pipeline.pkl",
    output_dir="results"
)

# Process a batch of IDAT files
processor.process_batch(
    input_dir="/path/to/idat/files",
    metadata_file="metadata.csv"
)

# Generate reports for all samples
processor.generate_reports(
    template="clinical_report_template.html"
)

Deployment Options

REST API

Deploy the model as a REST API using FastAPI.

# Start the API server
epiclassify serve --model finetuned_model.pt --host 0.0.0.0 --port 8000

Docker Container

Deploy the model as a Docker container for easy distribution.

# Build a Docker image
epiclassify build-docker --model finetuned_model.pt --tag epiclassify:latest

# Run the Docker container
docker run -p 8000:8000 epiclassify:latest

Batch Processing

Run batch processing on a directory of files.

# Process a directory of files
epiclassify batch --model finetuned_model.pt --input /path/to/input --output /path/to/output

Conducting DNA Methylation Tests

This guide explains the process of collecting samples and uploading data for EpiClassify analysis.

EpiClassify requires DNA methylation data from either Illumina Infinium EPIC (850K) or 450K arrays. Each sample produces two IDAT files that must be uploaded together.

Sample Collection

Blood Collection: Draw a standard 8.5mL EDTA blood tube from the patient during a routine visit.
Sample Handling: Store at room temperature and process within 24 hours.
DNA Extraction: Extract genomic DNA using standard protocols (Qiagen DNeasy or similar).
DNA Assessment: Verify DNA quality (260/280 ratio > 1.8) and quantity (200ng minimum).

Methylation Array Processing

Array Selection: Process sample using either:
- Illumina Infinium MethylationEPIC BeadChip (850K sites)
- Illumina Infinium HumanMethylation450 BeadChip (450K sites)
Array Processing: Follow Illumina's standard protocols for bisulfite conversion and array hybridization.
Data Generation: Scan arrays using an Illumina iScan system, which will produce IDAT files.
IDAT Files: Each sample will generate exactly two IDAT files:
- *_Grn.idat - Contains green channel fluorescence intensity data
- *_Red.idat - Contains red channel fluorescence intensity data

Both IDAT files (Green and Red channels) are required for analysis. EpiClassify cannot process normalized beta values, CSV files, or pre-processed data.

Data Upload

Upload sample data through the Clinicians Portal using one of the following methods:

Direct Upload

Log in to the Clinicians Portal
Navigate to the "Upload Data" tab
Drag and drop both IDAT files into the upload area or click to browse
Ensure both files belong to the same sample (same base filename)
Click "Upload" to start the upload process

Bulk Upload

Prepare multiple pairs of IDAT files
Click "Bulk Upload" in the Upload Data tab
Follow the instructions to upload multiple sample pairs
Ensure each sample has both Green and Red IDAT files

After Upload

Once files are uploaded, our system will:

Verify the integrity and compatibility of the IDAT files
Process methylation data using our preprocessing pipeline
Analyze the methylation patterns using our transformer model
Generate a comprehensive clinical report
Notify you when the analysis is complete (typically within 24-48 hours)

For laboratories interested in integrating with our system via API or SFTP, please contact [email protected] for implementation details.

References

Key references for the methods and techniques used in EpiClassify.

1

Vaswani, A., et al. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30.

2

Trivedi, M. S., et al. (2018). Identification of ME/CFS-associated DNA methylation patterns. PLoS One, 13(7), e0201066.

3

Xiao, Y., & Vermund, S. (2024). DNA methylation in long COVID. Frontiers in Virology, 4, 1234567.

4

Shazeer, N., et al. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.

5

Levy, J. J., et al. (2020). MethylNet: an automated deep learning approach for DNA methylation analysis. BMC Bioinformatics, 21(1), 1-15.

6

Wang, Y., et al. (2023). MuLan-Methyl: transformer-based DNA methylation prediction. GigaScience, 12, giad046.

7

Bibikova, M., et al. (2011). High density DNA methylation array with single CpG site resolution. Genomics, 98(4), 288-295.

8

Moran, S., et al. (2016). Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics, 8(3), 389-399.

Overview

Key Features

Methylation Data Processing

Transformer Architecture

Self-Supervised Pretraining

Interpretability Tools

Deployment Options

Clinical Reporting

Installation

Using pip

Using Docker

From Source

Quickstart Guide

Basic Usage

Using the CLI

Example Notebook

System Requirements

Software Dependencies

Core Dependencies

Preprocessing Dependencies

Visualization Dependencies

API Dependencies

Data Preprocessing

Preprocessing Pipeline

Data Import

Quality Control

Normalization

Batch Correction

Feature Selection

Complete Pipeline Example

Transformer Model

Architecture Overview

Key Components

Input Embedding

Self-Attention

Mixture-of-Experts Feed-Forward Network

Adaptive Computation Time

Classification Head

Model Configuration

Training & Evaluation

Self-Supervised Pretraining

Fine-tuning

Cross-Validation

Evaluation

Inference Pipeline

Basic Inference

Batch Processing

Deployment Options

REST API

Docker Container

Batch Processing

Conducting DNA Methylation Tests

Sample Collection

Methylation Array Processing

Data Upload

Direct Upload

Bulk Upload

After Upload

References

Ready to Get Started?