CLI Reference¶
Overview¶
We provide a set of command line scripts for your convenience. All scripts can be called in your terminal directly. See the following sections for more details.
Script |
Description |
|---|---|
Download required files to ~/.cache/chrombert/data, or other path your like. |
|
Make dataset for ChromBERT forward. |
|
Get mean pooled TRN embedding (region embedding) and store in a file. |
|
Get cistrome embedding and store in a file. |
|
Get regulator embedding and store in a file. |
|
Generate cistromes using prompt-enhanced ChromBERT. |
|
Generate cistromes using prompt-enhanced ChromBERT, specified for single-cell data. |
Details¶
chrombert_prepare_env¶
Download required files to ~/.cache/chrombert/data, or other path your like.
chrombert_prepare_env [OPTIONS]
Options
- --help¶
Show this message and exit.
- --basedir¶
The directory to store the data. Default is ~/.cache/chrombert/data.
- --hf-endpoint¶
The endpoint of the Hugging Face model.
chrombert_make_dataset¶
Generate general datasets for ChromBERT from bed files.
chrombert_make_datasets [OPTIONS] BED
Options
- BED¶
Path to the bed file.
- -o, --oname¶
Path to the output file. Stdout if not specified. Must end with .tsv or .txt.
- --mode¶
Mode to generate the dataset. Choices are:
region: only consider overlap between input regions to determine the label generated. Useful for narrowPeak-like input.
all: report all overlapping status like bedtools intersect -wao. You should determine the label column by yourself.
Default is region.
- --center¶
If used, only consider the center of the input regions.
- --label¶
If mode is not region, this column will be used as the label. Default is the 4th column (1-based).
- --no-filter¶
Do not filter the regions that are not overlapped.
- --basedir¶
Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.
- -g, --genome¶
Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.
- -hr, --high-resolution¶
Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.
chrombert_get_region_emb¶
Extract mean pooled TRN embeddings (region embeddings) from ChromBERT.
chrombert_get_region_emb [OPTIONS] SUPERVISED_FILE -o ONAME
Options
- SUPERVISED_FILE¶
Path to the supervised file.
- -o, --oname¶
Path to the output HDF5 file. This option is required.
- --basedir¶
Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.
- -g, --genome¶
Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.
- -k, --ckpt¶
Path to the pretrain or fine-tuned checkpoint. Optional if it can be inferred from other arguments.
- --mask¶
Path to the matrix mask file. Optional if it can be inferred from other arguments.
- -d, --hdf5-file¶
Path to the HDF5 file that contains the dataset. Optional if it can be inferred from other arguments.
- -hr, --high-resolution¶
Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.
- --gpu¶
GPU index. Default is 0.
- --batch-size¶
Batch size. Default is 8.
- --num-workers¶
Number of workers for the dataloader. Default is 8.
chrombert_get_cistrome_emb¶
Extract cistrome embeddings from ChromBERT.
chrombert_get_cistrome_emb [OPTIONS] SUPERVISED_FILE IDS... -o ONAME
Options
- SUPERVISED_FILE¶
Path to the supervised file.
- IDS¶
IDs to extract. Can be in GSMID format or the regulator:cellline format. To generate a cache file for prompts, use the regulator:cellline format.
- -o, --oname¶
Path to the output HDF5 file. This option is required.
- --basedir¶
Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.
- -g, --genome¶
Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.
- -k, --ckpt¶
Path to the pretrain or fine-tuned checkpoint. Optional if it can be inferred from other arguments.
- --meta¶
Path to the meta file. Optional if it can be inferred from other arguments.
- --mask¶
Path to the matrix mask file. Optional if it can be inferred from other arguments.
- -d, --hdf5-file¶
Path to the HDF5 file that contains the dataset. Optional if it can be inferred from other arguments.
- -hr, --high-resolution¶
Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.
- --batch-size¶
Batch size. Default is 8.
- --num-workers¶
Number of workers for the dataloader. Default is 8.
chrombert_get_regulator_emb¶
Extract regulator embeddings from ChromBERT.
chrombert_get_regulator_emb [OPTIONS] SUPERVISED_FILE IDS... -o ONAME
Options
- SUPERVISED_FILE¶
Path to the supervised file.
- IDS¶
Regulator names to extract. Must be in lower case.
- -o, --oname¶
Path to the output HDF5 file. This option is required.
- --basedir¶
Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.
- -g, --genome¶
Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.
- -k, --ckpt¶
Path to the pretrain or fine-tuned checkpoint. Optional if it can be inferred from other arguments.
- --meta¶
Path to the meta file. Optional if it can be inferred from other arguments.
- --mask¶
Path to the matrix mask file. Optional if it can be inferred from other arguments.
- -d, --hdf5-file¶
Path to the HDF5 file that contains the dataset. Optional if it can be inferred from other arguments.
- -hr, --high-resolution¶
Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.
- --batch-size¶
Batch size. Default is 8.
- --num-workers¶
Number of workers for the dataloader. Default is 8.
chrombert_imputation_cistrome¶
Generate prediction result (full bigwig file or table) from ChromBERT when given cell type name, region and regulator.
Note
Either –o-bw or –o-table must be provided, depends on which format you want to output the results.
chrombert_imputation_cistrome [OPTIONS] SUPERVISED_FILE --o-bw BW_PATH --o-table TABLE_PATH --finetune-ckpt CKPT --prompt-kind KIND
Options
- supervised_file¶
Path to the supervised file.
- --o-bw¶
Path of the output BigWig file.
- --o-table¶
Path to the output table if you want to output the table.
- --prompt-kind¶
Prompt data class. Choose from cistrome or expression. This option is required.
- --basedir¶
Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.
- -g, --genome¶
Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.
- --pretrain-ckpt¶
Path to the pretrain checkpoint. Optional if it could be inferred from other arguments.
- -d, --hdf5-file¶
Path to the HDF5 file that contains the dataset. Optional if it could be inferred from other arguments.
- -hr, --high-resolution¶
Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.
- --finetune-ckpt¶
Path to the finetune checkpoint. Optional.
- --prompt-dim-external¶
Dimension of external data. Use 512 for scGPT and 768 for ChromBERT’s embedding. Default is 512.
- --prompt-celltype-cache-file¶
Path to the cell-type-specific prompt cache file. Optional.
- --prompt-regulator-cache-file¶
Path to the regulator prompt cache file. Optional.
- --prompt-celltype¶
The cell-type-specific prompt. For example, dnase:k562 for cistrome prompt and k562 for expression prompt. It can also be provided in the supervised file if the format supports. Optional.
- --prompt-regulator¶
The regulator prompt. Determine the kind of output. For example, ctcf or h3k27ac. It can also be provided in the supervised file if the format supports. Optional.
- --batch-size¶
Batch size. Default is 8.
- --num-workers¶
Number of workers for the dataloader. Default is 8.
chrombert_imputation_cistrome_sc¶
Generate prediction result (hdf5 format) from ChromBERT when given single cell, region and regulator.
chrombert_imputation_cistrome_sc [OPTIONS] SUPERVISED_FILE --o-h5 H5_PATH --finetune-ckpt CKPT --prompt-kind KIND
Options
- supervised_file¶
Path to the supervised file.
- --o-h5¶
Path of the output HDF5 file. This option is required.
- --prompt-kind¶
Prompt data class. Choose from cistrome or expression. This option is required.
- --basedir¶
Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.
- -g, --genome¶
Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.
- --pretrain-ckpt¶
Path to the pretrain checkpoint. Optional if it could be inferred from other arguments.
- -d, --hdf5-file¶
Path to the HDF5 file that contains the dataset. Optional if it could be inferred from other arguments.
- -hr, --high-resolution¶
Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.
- --finetune-ckpt¶
Path to the finetune checkpoint. Optional.
- --prompt-dim-external¶
Dimension of external data. Use 512 for scGPT and 768 for ChromBERT’s embedding. Default is 512.
- --prompt-celltype-cache-file¶
Path to the cell-type-specific prompt cache file. Optional.
- --prompt-regulator-cache-file¶
Path to the regulator prompt cache file. Optional.
- --prompt-regulator-cache-pin-memory¶
- Pin memory for regulator prompt cache for further accelerating. Default is False.¶
- --prompt-regulator-cache-limit¶
- The limit of regulator prompt cached in memory. Be mindful of your memory usage!¶
- --prompt-celltype¶
The cell-type-specific prompt. For example, dnase:k562 for cistrome prompt and k562 for expression prompt. It can also be provided in the supervised file if the format supports. Optional.
- --prompt-regulator¶
The regulator prompt. Determine the kind of output. For example, ctcf or h3k27ac. It can also be provided in the supervised file if the format supports. Optional.
- --batch-size¶
Batch size. Default is 8.
- --num-workers¶
Number of workers for the dataloader. Default is 8.