Gene expression prediction

Gene expression is influenced by multiple regulatory regions, often extending over significant genomic distances, particularly near the transcription start site (TSS). This task uses a flank window to consider multiple nearby regions, providing a holistic view of regulatory impacts on gene expression.

python ft_gep.py [OPTIONS] --flank-window FLANK_WINDOW_SIZE \
--train TRAIN_PATH \
--valid VALID_PATH \
--test TEST_PATH

Options

--lr

Learning rate. Default is 1e-4.

--warmup-ratio

Warmup ratio for the learning rate. Default is 0.1.

--grad-samples

Number of gradient samples, scaled by batch size and GPU count. Default is 128.

--pretrain-trainable

Number of pretrained layers to be trainable. Default is 2.

--max-epochs

Maximum number of training epochs. Default is 10.

--tag

Tag of the trainer, used for grouping logged results. Default is default.

--limit-val-batches

Number of batches to use for each validation. Default is 64.

--val-check-interval

Interval for validation checks. Default is 64.

--name

Name of the training session. Default is chrombert-ft-gep.

--save-top-k

Number of top-performing checkpoints to save. Default is 3.

--checkpoint-metric

Metric for checkpointing. Default is pcc.

--checkpoint-mode

Mode for checkpointing. Default is max.

--log-every-n-steps

Logging frequency in terms of steps. Default is 50.

--kind

Type of task, such as regression, zero_inflation. Default is regression.

--loss

Loss function to be used. Default is rmse.

--train

Path to the training data. This option is required.

--valid

Path to the validation data. This option is required.

--test

Path to the test data. This option is required.

--batch-size

Batch size for training. Default is 2.

--num-workers

Number of workers for data loading. Default is 4.

--basedir

Path to the base directory for model and data files. Default is os.path.expanduser("~/.cache/chrombert/data").

-g, --genome

Genome version. Only hg38 is supported now. Default is hg38.

-k, --ckpt

Path to the pretrained checkpoint. Optional if it could be inferred from other arguments.

--mask

Path to the mtx mask file. Optional if it could be inferred from other arguments.

-d, --hdf5-file

Path to the HDF5 file that contains the dataset. Optional if it could be inferred from other arguments.

--dropout

Dropout rate for the model. Default is 0.1.

-hr, --high-resolution

Use 200-bp resolution instead of 1-kb. Note: 200-bp resolution is not available yet, preparing for future release.

--flank-window

Flank window size for genomic data embedding. Default is 4.

--gep-zero-inflation

Specifies whether to include zero inflation in the GEP header. Default is False.

--gep-parallel-embedding

Enable parallel embedding, which is faster but requires more GPU memory.

--gep-gradient-checkpoint

Use gradient checkpointing to reduce GPU memory usage during training.