Prompt-enhanced

This script allows you to fine-tune ChromBERT by adding extra information as prompts. You can include things like cell-type features or DNA sequence patterns to help the model make better predictions. The model uses these prompts as additional clues when analyzing genomic data.

python ft_prompt_enhanced.py [OPTIONS] --prompt-kind KIND \
    --train TRAIN_PATH \
    --valid VALID_PATH \
    --test TEST_PATH

# use cache file for acceleration
python ft_prompt_enhanced.py [OPTIONS] \
    --prompt-kind KIND  \
    --prompt-regulator-cache-file CACHE_PATH1 \
    --prompt-celltype-cache-file CACHE_PATH2 \
    --train TRAIN_PATH \
    --valid VALID_PATH \
    --test TEST_PATH

Options

--lr

Learning rate. Default is 1e-4.

--warmup-ratio

Warmup ratio. Default is 0.1.

--grad-samples

Number of gradient samples. Automatically scaled according to the batch size and GPU number. Default is 512.

--pretrain-trainable

Number of pretrained layers to be trainable. Default is 0.

--max-epochs

Number of epochs to train. Default is 10.

--tag

Tag of the trainer, used for grouping logged results. Default is default.

--limit-val-batches

Number of batches to use for each validation. Default is 64.

--val-check-interval

Validation check interval. Default is 64.

--name

Name of the trainer. Default is chrombert-ft-prompt-enhanced.

--save-top-k

Save top k checkpoints. Default is 3.

--checkpoint-metric

Checkpoint metric. Default is bce.

--checkpoint-mode

Checkpoint mode. Default is min.

--log-every-n-steps

Log every n steps. Default is 50.

--kind

Kind of the task. Choose from classification, regression, or zero_inflation. Default is classification.

--loss

Loss function. Default is focal.

--train

Path to the training data. This option is required.

--valid

Path to the validation data. This option is required.

--test

Path to the test data. This option is required.

--batch-size

Batch size. Default is 8. It’s suggested to set a larger number to accelerate training here.

--num-workers

Number of workers. Default is 4.

--basedir

Path to the base directory. Default is set to the value of os.path.expanduser("~/.cache/chrombert/data").

-g, --genome

Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.

-k, --ckpt

Path to the checkpoints used to initialize the model. Optional. Defualt is the pretrain checkpoint provided in the base directory.

--mask
Path to the mtx mask file. Optional if it could infered from other arguments.
-d, --hdf5-file

Path to the HDF5 file that contains the dataset. Optional if it could be inferred from other arguments.

--dropout

Dropout rate. Default is 0.1.

-hr, --high-resolution

Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.

--prompt-kind

Prompt data class. Choose from cistrome or expression. Default is None. This option is required.

--prompt-dim-external

Dimension of external data. Use 512 for scGPT, and 768 for ChromBERT’s embedding. Default is 512.

--prompt-celltype-cache-file

Path to the cell-type-specific prompt cache file. Provided if you want to use cache file to accelerate the training process. Optional. Default is not use it.

--prompt-regulator-cache-file

Path to the regulator prompt cache file. Provided if you want to use cache file to accelerate the training process. Optional. Default is not use it.