Prompt-enhanced¶
This script allows you to fine-tune ChromBERT by adding extra information as prompts. You can include things like cell-type features or DNA sequence patterns to help the model make better predictions. The model uses these prompts as additional clues when analyzing genomic data.
python ft_prompt_enhanced.py [OPTIONS] --prompt-kind KIND \
--train TRAIN_PATH \
--valid VALID_PATH \
--test TEST_PATH
# use cache file for acceleration
python ft_prompt_enhanced.py [OPTIONS] \
--prompt-kind KIND \
--prompt-regulator-cache-file CACHE_PATH1 \
--prompt-celltype-cache-file CACHE_PATH2 \
--train TRAIN_PATH \
--valid VALID_PATH \
--test TEST_PATH
Options
- --lr¶
Learning rate. Default is 1e-4.
- --warmup-ratio¶
Warmup ratio. Default is 0.1.
- --grad-samples¶
Number of gradient samples. Automatically scaled according to the batch size and GPU number. Default is 512.
- --pretrain-trainable¶
Number of pretrained layers to be trainable. Default is 0.
- --max-epochs¶
Number of epochs to train. Default is 10.
- --tag¶
Tag of the trainer, used for grouping logged results. Default is default.
- --limit-val-batches¶
Number of batches to use for each validation. Default is 64.
- --val-check-interval¶
Validation check interval. Default is 64.
- --name¶
Name of the trainer. Default is chrombert-ft-prompt-enhanced.
- --save-top-k¶
Save top k checkpoints. Default is 3.
- --checkpoint-metric¶
Checkpoint metric. Default is bce.
- --checkpoint-mode¶
Checkpoint mode. Default is min.
- --log-every-n-steps¶
Log every n steps. Default is 50.
- --kind¶
Kind of the task. Choose from classification, regression, or zero_inflation. Default is classification.
- --loss¶
Loss function. Default is focal.
- --train¶
Path to the training data. This option is required.
- --valid¶
Path to the validation data. This option is required.
- --test¶
Path to the test data. This option is required.
- --batch-size¶
Batch size. Default is 8. It’s suggested to set a larger number to accelerate training here.
- --num-workers¶
Number of workers. Default is 4.
- --basedir¶
Path to the base directory. Default is set to the value of
os.path.expanduser("~/.cache/chrombert/data").
- -g, --genome¶
Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.
- -k, --ckpt¶
Path to the checkpoints used to initialize the model. Optional. Defualt is the pretrain checkpoint provided in the base directory.
- -d, --hdf5-file¶
Path to the HDF5 file that contains the dataset. Optional if it could be inferred from other arguments.
- --dropout¶
Dropout rate. Default is 0.1.
- -hr, --high-resolution¶
Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.
- --prompt-kind¶
Prompt data class. Choose from cistrome or expression. Default is None. This option is required.
- --prompt-dim-external¶
Dimension of external data. Use 512 for scGPT, and 768 for ChromBERT’s embedding. Default is 512.
- --prompt-celltype-cache-file¶
Path to the cell-type-specific prompt cache file. Provided if you want to use cache file to accelerate the training process. Optional. Default is not use it.
- --prompt-regulator-cache-file¶
Path to the regulator prompt cache file. Provided if you want to use cache file to accelerate the training process. Optional. Default is not use it.