Prompt-enhanced¶

This script allows you to fine-tune ChromBERT by adding extra information as prompts. You can include things like cell-type features or DNA sequence patterns to help the model make better predictions. The model uses these prompts as additional clues when analyzing genomic data.

python ft_prompt_enhanced.py [OPTIONS] --prompt-kind KIND \
    --train TRAIN_PATH \
    --valid VALID_PATH \
    --test TEST_PATH

# use cache file for acceleration
python ft_prompt_enhanced.py [OPTIONS] \
    --prompt-kind KIND  \
    --prompt-regulator-cache-file CACHE_PATH1 \
    --prompt-celltype-cache-file CACHE_PATH2 \
    --train TRAIN_PATH \
    --valid VALID_PATH \
    --test TEST_PATH

Options

--lr¶: Learning rate. Default is 1e-4.

--warmup-ratio¶: Warmup ratio. Default is 0.1.

--grad-samples¶: Number of gradient samples. Automatically scaled according to the batch size and GPU number. Default is 512.

--pretrain-trainable¶: Number of pretrained layers to be trainable. Default is 0.

--max-epochs¶: Number of epochs to train. Default is 10.

--tag¶: Tag of the trainer, used for grouping logged results. Default is default.

--limit-val-batches¶: Number of batches to use for each validation. Default is 64.

--val-check-interval¶: Validation check interval. Default is 64.

--name¶: Name of the trainer. Default is chrombert-ft-prompt-enhanced.

--save-top-k¶: Save top k checkpoints. Default is 3.

--checkpoint-metric¶: Checkpoint metric. Default is bce.

--checkpoint-mode¶: Checkpoint mode. Default is min.

--log-every-n-steps¶: Log every n steps. Default is 50.

--kind¶: Kind of the task. Choose from classification, regression, or zero_inflation. Default is classification.

--loss¶: Loss function. Default is focal.

--train¶: Path to the training data. This option is required.

--valid¶: Path to the validation data. This option is required.

--test¶: Path to the test data. This option is required.

--batch-size¶: Batch size. Default is 8. It’s suggested to set a larger number to accelerate training here.

--num-workers¶: Number of workers. Default is 4.

--basedir¶: Path to the base directory. Default is set to the value of os.path.expanduser("~/.cache/chrombert/data").

-g, --genome¶: Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.

-k, --ckpt¶: Path to the checkpoints used to initialize the model. Optional. Defualt is the pretrain checkpoint provided in the base directory.

--mask¶
Path to the mtx mask file. Optional if it could infered from other arguments.¶

-d, --hdf5-file¶: Path to the HDF5 file that contains the dataset. Optional if it could be inferred from other arguments.

--dropout¶: Dropout rate. Default is 0.1.

-hr, --high-resolution¶: Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.

--prompt-kind¶: Prompt data class. Choose from cistrome or expression. Default is None. This option is required.

--prompt-dim-external¶: Dimension of external data. Use 512 for scGPT, and 768 for ChromBERT’s embedding. Default is 512.

--prompt-celltype-cache-file¶: Path to the cell-type-specific prompt cache file. Provided if you want to use cache file to accelerate the training process. Optional. Default is not use it.

--prompt-regulator-cache-file¶: Path to the regulator prompt cache file. Provided if you want to use cache file to accelerate the training process. Optional. Default is not use it.