CLI Reference¶

Overview¶

We provide a set of command line scripts for your convenience. All scripts can be called in your terminal directly. See the following sections for more details.

Scripts Instruction¶
Script	Description
chrombert_prepare_env	Download required files to ~/.cache/chrombert/data, or other path your like.
chrombert_make_dataset	Make dataset for ChromBERT forward.
chrombert_get_region_emb	Get mean pooled TRN embedding (region embedding) and store in a file.
chrombert_get_cistrome_emb	Get cistrome embedding and store in a file.
chrombert_get_regulator_emb	Get regulator embedding and store in a file.
chrombert_imputation_cistrome	Generate cistromes using prompt-enhanced ChromBERT.
chrombert_imputation_cistrome_sc	Generate cistromes using prompt-enhanced ChromBERT, specified for single-cell data.

Details¶

chrombert_prepare_env¶

Download required files to ~/.cache/chrombert/data, or other path your like.

chrombert_prepare_env [OPTIONS]

Options

--help¶: Show this message and exit.

--basedir¶: The directory to store the data. Default is ~/.cache/chrombert/data.

--hf-endpoint¶: The endpoint of the Hugging Face model.

chrombert_make_dataset¶

Generate general datasets for ChromBERT from bed files.

chrombert_make_datasets [OPTIONS] BED

Options

BED¶: Path to the bed file.

-o, --oname¶: Path to the output file. Stdout if not specified. Must end with .tsv or .txt.

--mode¶

Mode to generate the dataset. Choices are:

region: only consider overlap between input regions to determine the label generated. Useful for narrowPeak-like input.
all: report all overlapping status like bedtools intersect -wao. You should determine the label column by yourself.

Default is region.

--center¶: If used, only consider the center of the input regions.

--label¶: If mode is not region, this column will be used as the label. Default is the 4th column (1-based).

--no-filter¶: Do not filter the regions that are not overlapped.

--basedir¶: Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.

-g, --genome¶: Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.

-hr, --high-resolution¶: Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.

chrombert_get_region_emb¶

Extract mean pooled TRN embeddings (region embeddings) from ChromBERT.

chrombert_get_region_emb [OPTIONS] SUPERVISED_FILE -o ONAME

Options

SUPERVISED_FILE¶: Path to the supervised file.

-o, --oname¶: Path to the output HDF5 file. This option is required.

--basedir¶: Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.

-g, --genome¶: Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.

-k, --ckpt¶: Path to the pretrain or fine-tuned checkpoint. Optional if it can be inferred from other arguments.

--mask¶: Path to the matrix mask file. Optional if it can be inferred from other arguments.

-d, --hdf5-file¶: Path to the HDF5 file that contains the dataset. Optional if it can be inferred from other arguments.

-hr, --high-resolution¶: Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.

--gpu¶: GPU index. Default is 0.

--batch-size¶: Batch size. Default is 8.

--num-workers¶: Number of workers for the dataloader. Default is 8.

chrombert_get_cistrome_emb¶

Extract cistrome embeddings from ChromBERT.

chrombert_get_cistrome_emb [OPTIONS] SUPERVISED_FILE IDS... -o ONAME

Options

SUPERVISED_FILE¶: Path to the supervised file.

IDS¶: IDs to extract. Can be in GSMID format or the regulator:cellline format. To generate a cache file for prompts, use the regulator:cellline format.

-o, --oname¶: Path to the output HDF5 file. This option is required.

--basedir¶: Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.

-g, --genome¶: Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.

-k, --ckpt¶: Path to the pretrain or fine-tuned checkpoint. Optional if it can be inferred from other arguments.

--meta¶: Path to the meta file. Optional if it can be inferred from other arguments.

--mask¶: Path to the matrix mask file. Optional if it can be inferred from other arguments.

-d, --hdf5-file¶: Path to the HDF5 file that contains the dataset. Optional if it can be inferred from other arguments.

-hr, --high-resolution¶: Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.

--batch-size¶: Batch size. Default is 8.

--num-workers¶: Number of workers for the dataloader. Default is 8.

chrombert_get_regulator_emb¶

Extract regulator embeddings from ChromBERT.

chrombert_get_regulator_emb [OPTIONS] SUPERVISED_FILE IDS... -o ONAME

Options

SUPERVISED_FILE¶: Path to the supervised file.

IDS¶: Regulator names to extract. Must be in lower case.

-o, --oname¶: Path to the output HDF5 file. This option is required.

--basedir¶: Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.

-g, --genome¶: Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.

-k, --ckpt¶: Path to the pretrain or fine-tuned checkpoint. Optional if it can be inferred from other arguments.

--meta¶: Path to the meta file. Optional if it can be inferred from other arguments.

--mask¶: Path to the matrix mask file. Optional if it can be inferred from other arguments.

-d, --hdf5-file¶: Path to the HDF5 file that contains the dataset. Optional if it can be inferred from other arguments.

-hr, --high-resolution¶: Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.

--batch-size¶: Batch size. Default is 8.

--num-workers¶: Number of workers for the dataloader. Default is 8.

chrombert_imputation_cistrome¶

Generate prediction result (full bigwig file or table) from ChromBERT when given cell type name, region and regulator.

Note

Either –o-bw or –o-table must be provided, depends on which format you want to output the results.

chrombert_imputation_cistrome [OPTIONS] SUPERVISED_FILE --o-bw BW_PATH --o-table TABLE_PATH --finetune-ckpt CKPT --prompt-kind KIND

Options

supervised_file¶: Path to the supervised file.

--o-bw¶: Path of the output BigWig file.

--o-table¶: Path to the output table if you want to output the table.

--prompt-kind¶

Prompt data class. Choose from cistrome or expression. This option is required.

--basedir¶: Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.

-g, --genome¶: Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.

--pretrain-ckpt¶: Path to the pretrain checkpoint. Optional if it could be inferred from other arguments.

-d, --hdf5-file¶: Path to the HDF5 file that contains the dataset. Optional if it could be inferred from other arguments.

-hr, --high-resolution¶: Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.

--finetune-ckpt¶: Path to the finetune checkpoint. Optional.

--prompt-dim-external¶: Dimension of external data. Use 512 for scGPT and 768 for ChromBERT’s embedding. Default is 512.

--prompt-celltype-cache-file¶: Path to the cell-type-specific prompt cache file. Optional.

--prompt-regulator-cache-file¶: Path to the regulator prompt cache file. Optional.

--prompt-celltype¶: The cell-type-specific prompt. For example, dnase:k562 for cistrome prompt and k562 for expression prompt. It can also be provided in the supervised file if the format supports. Optional.

--prompt-regulator¶: The regulator prompt. Determine the kind of output. For example, ctcf or h3k27ac. It can also be provided in the supervised file if the format supports. Optional.

--batch-size¶: Batch size. Default is 8.

--num-workers¶: Number of workers for the dataloader. Default is 8.

chrombert_imputation_cistrome_sc¶

Generate prediction result (hdf5 format) from ChromBERT when given single cell, region and regulator.

chrombert_imputation_cistrome_sc [OPTIONS] SUPERVISED_FILE --o-h5 H5_PATH --finetune-ckpt CKPT --prompt-kind KIND

Options

supervised_file¶: Path to the supervised file.

--o-h5¶: Path of the output HDF5 file. This option is required.

--prompt-kind¶

Prompt data class. Choose from cistrome or expression. This option is required.

--basedir¶: Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.

-g, --genome¶: Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.

--pretrain-ckpt¶: Path to the pretrain checkpoint. Optional if it could be inferred from other arguments.

-d, --hdf5-file¶: Path to the HDF5 file that contains the dataset. Optional if it could be inferred from other arguments.

-hr, --high-resolution¶: Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.

--finetune-ckpt¶: Path to the finetune checkpoint. Optional.

--prompt-dim-external¶: Dimension of external data. Use 512 for scGPT and 768 for ChromBERT’s embedding. Default is 512.

--prompt-celltype-cache-file¶: Path to the cell-type-specific prompt cache file. Optional.

--prompt-regulator-cache-file¶: Path to the regulator prompt cache file. Optional.

--prompt-regulator-cache-pin-memory¶
Pin memory for regulator prompt cache for further accelerating. Default is False.¶

--prompt-regulator-cache-limit¶
The limit of regulator prompt cached in memory. Be mindful of your memory usage!¶

--prompt-celltype¶: The cell-type-specific prompt. For example, dnase:k562 for cistrome prompt and k562 for expression prompt. It can also be provided in the supervised file if the format supports. Optional.

--prompt-regulator¶: The regulator prompt. Determine the kind of output. For example, ctcf or h3k27ac. It can also be provided in the supervised file if the format supports. Optional.

--batch-size¶: Batch size. Default is 8.

--num-workers¶: Number of workers for the dataloader. Default is 8.