Quick Tour

Before starting this quick tour, ensure you are familiar with the basics of the PyTorch Lightning framework, including the LightningDataModule and LightningModule. What’s more, make sure you have downloaded the necessary file by executing the chrombert_prepare_env command, see chrombert_prepare_env for details.

OK! Let’s get started!

1 Customize the input

You can customize the model’s input by assigning parameters to the chrombert.DatasetConfig class. Key parameters include:

kind: Specifies the input format, which varies across tasks. It is crucial to assign this based on your specific task. Different tasks may require additional parameters, which you can find here.

hdf5_file: A preprocessed HDF5 file containing features for 1kb bins across the genome. This file is cached in the default directory (~/.cache/chrombert/data/hg38_6k_1kb.hdf5) upon installation, unless customized.

supervised_file: A input dataset containing at least four columns: chrom,start,end, build_region_index. These four columns are used to locate and retrieve features for the regions. Depending on the task, you can add additional columns like label. The build_region_index for each region is cached in the default directory (~/.cache/chrombert/config/hg38_6k_1kb_region.bed) upon installation, unless customized.

You can also configure other parameters like batch_size and num_workers.

import chrombert

# Create a DatasetConfig object with your settings
dc = chrombert.DatasetConfig(hdf5_file="~/.cache/chrombert/data/hg38_6k_1kb.hdf5",
                        kind="GeneralDataset", supervised_file="<your_path_input_data>")

# Initialize inputs in whatever formats you want
ds = dc.init_dataset()  # Dataset

dl = dc.init_dataloader()  # Dataloader

dm = chrombert.LitChromBERTFTDataModule(config=dc,
                        train_params={"supervised_file": args.train},
                        val_params={"supervised_file": args.valid},
                        test_params={"supervised_file": args.test})  # LightningDataModule

2 Customize the model

The model structure depends on the task at hand. Use the ChromBERTFTConfigclass to specify the task and configure its parameters. Remember to assign the pretrain_ckpt parameter if you want to use the pre-trained ChromBERT model. The checkpoint file is cached in the default directory (~/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt) upon installation, unless customized.

Different tasks may require additional parameters, which you can find here.

# Configure the model for your task
mc = chrombert.ChromBERTFTConfig(task='general',
                     pretrain_ckpt="~/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt")
model = mc.init_model()

# Optional: Manage trainable parameters
# model.freeze_pretrain(trainable:int)  # Freeze transformer layers
# model.display_trainable_parameters()  # Display the number of trainable layers

3 Customize the training process

Once the input and model are configured, you can customize the training process, including:

  • Task Type (kind): "regression" or "classification".

  • Loss Function (loss): Specify the type of loss (e.g., "bce" for binary cross-entropy).

  • Learning Rate (lr): Set the desired learning rate.

Explore other customizable training parameters here.

config_train = chrombert.finetune.TrainConfig(kind="classification",
                                             loss="bce", lr=1e-4)
pl_module = config_train.init_pl_module(model)
trainer = config_train.init_trainer()

4 Start training !

With everything in place, you’re ready to train the model:

trainer.fit(pl_module, datamodule = dm)

5 Task templates

To make your workflow easier, we’ve prepared a collection of ready-to-use scripts for different tasks. You can find detailed instructions and examples here.