Gene expression prediction¶
Gene expression is influenced by multiple regulatory regions, often extending over significant genomic distances, particularly near the transcription start site (TSS). This task uses a flank window to consider multiple nearby regions, providing a holistic view of regulatory impacts on gene expression.
python ft_gep.py [OPTIONS] --flank-window FLANK_WINDOW_SIZE \
--train TRAIN_PATH \
--valid VALID_PATH \
--test TEST_PATH
Options
- --lr¶
Learning rate. Default is 1e-4.
- --warmup-ratio¶
Warmup ratio for the learning rate. Default is 0.1.
- --grad-samples¶
Number of gradient samples, scaled by batch size and GPU count. Default is 128.
- --pretrain-trainable¶
Number of pretrained layers to be trainable. Default is 2.
- --max-epochs¶
Maximum number of training epochs. Default is 10.
- --tag¶
Tag of the trainer, used for grouping logged results. Default is default.
- --limit-val-batches¶
Number of batches to use for each validation. Default is 64.
- --val-check-interval¶
Interval for validation checks. Default is 64.
- --name¶
Name of the training session. Default is chrombert-ft-gep.
- --save-top-k¶
Number of top-performing checkpoints to save. Default is 3.
- --checkpoint-metric¶
Metric for checkpointing. Default is pcc.
- --checkpoint-mode¶
Mode for checkpointing. Default is max.
- --log-every-n-steps¶
Logging frequency in terms of steps. Default is 50.
- --kind¶
Type of task, such as regression, zero_inflation. Default is regression.
- --loss¶
Loss function to be used. Default is rmse.
- --train¶
Path to the training data. This option is required.
- --valid¶
Path to the validation data. This option is required.
- --test¶
Path to the test data. This option is required.
- --batch-size¶
Batch size for training. Default is 2.
- --num-workers¶
Number of workers for data loading. Default is 4.
- --basedir¶
Path to the base directory for model and data files. Default is
os.path.expanduser("~/.cache/chrombert/data").
- -g, --genome¶
Genome version. Only hg38 is supported now. Default is hg38.
- -k, --ckpt¶
Path to the pretrained checkpoint. Optional if it could be inferred from other arguments.
- --mask¶
Path to the mtx mask file. Optional if it could be inferred from other arguments.
- -d, --hdf5-file¶
Path to the HDF5 file that contains the dataset. Optional if it could be inferred from other arguments.
- --dropout¶
Dropout rate for the model. Default is 0.1.
- -hr, --high-resolution¶
Use 200-bp resolution instead of 1-kb. Note: 200-bp resolution is not available yet, preparing for future release.
- --flank-window¶
Flank window size for genomic data embedding. Default is 4.
- --gep-zero-inflation¶
Specifies whether to include zero inflation in the GEP header. Default is False.
- --gep-parallel-embedding¶
Enable parallel embedding, which is faster but requires more GPU memory.
- --gep-gradient-checkpoint¶
Use gradient checkpointing to reduce GPU memory usage during training.