CLI Reference

Overview

We provide a set of command line scripts for your convenience. All scripts can be called in your terminal directly. See the following sections for more details.

Scripts Instruction

Script

Description

chrombert_prepare_env

Download required files to ~/.cache/chrombert/data, or other path your like.

chrombert_make_dataset

Make dataset for ChromBERT forward.

chrombert_get_region_emb

Get mean pooled TRN embedding (region embedding) and store in a file.

chrombert_get_cistrome_emb

Get cistrome embedding and store in a file.

chrombert_get_regulator_emb

Get regulator embedding and store in a file.

chrombert_imputation_cistrome

Generate cistromes using prompt-enhanced ChromBERT.

chrombert_imputation_cistrome_sc

Generate cistromes using prompt-enhanced ChromBERT, specified for single-cell data.


Details

chrombert_prepare_env

Download required files to ~/.cache/chrombert/data, or other path your like.

chrombert_prepare_env [OPTIONS]

Options

--help

Show this message and exit.

--basedir

The directory to store the data. Default is ~/.cache/chrombert/data.

--hf-endpoint

The endpoint of the Hugging Face model.


chrombert_make_dataset

Generate general datasets for ChromBERT from bed files.

chrombert_make_datasets [OPTIONS] BED

Options

BED

Path to the bed file.

-o, --oname

Path to the output file. Stdout if not specified. Must end with .tsv or .txt.

--mode

Mode to generate the dataset. Choices are:

  • region: only consider overlap between input regions to determine the label generated. Useful for narrowPeak-like input.

  • all: report all overlapping status like bedtools intersect -wao. You should determine the label column by yourself.

Default is region.

--center

If used, only consider the center of the input regions.

--label

If mode is not region, this column will be used as the label. Default is the 4th column (1-based).

--no-filter

Do not filter the regions that are not overlapped.

--basedir

Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.

-g, --genome

Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.

-hr, --high-resolution

Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.


chrombert_get_region_emb

Extract mean pooled TRN embeddings (region embeddings) from ChromBERT.

chrombert_get_region_emb [OPTIONS] SUPERVISED_FILE -o ONAME

Options

SUPERVISED_FILE

Path to the supervised file.

-o, --oname

Path to the output HDF5 file. This option is required.

--basedir

Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.

-g, --genome

Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.

-k, --ckpt

Path to the pretrain or fine-tuned checkpoint. Optional if it can be inferred from other arguments.

--mask

Path to the matrix mask file. Optional if it can be inferred from other arguments.

-d, --hdf5-file

Path to the HDF5 file that contains the dataset. Optional if it can be inferred from other arguments.

-hr, --high-resolution

Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.

--gpu

GPU index. Default is 0.

--batch-size

Batch size. Default is 8.

--num-workers

Number of workers for the dataloader. Default is 8.


chrombert_get_cistrome_emb

Extract cistrome embeddings from ChromBERT.

chrombert_get_cistrome_emb [OPTIONS] SUPERVISED_FILE IDS... -o ONAME

Options

SUPERVISED_FILE

Path to the supervised file.

IDS

IDs to extract. Can be in GSMID format or the regulator:cellline format. To generate a cache file for prompts, use the regulator:cellline format.

-o, --oname

Path to the output HDF5 file. This option is required.

--basedir

Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.

-g, --genome

Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.

-k, --ckpt

Path to the pretrain or fine-tuned checkpoint. Optional if it can be inferred from other arguments.

--meta

Path to the meta file. Optional if it can be inferred from other arguments.

--mask

Path to the matrix mask file. Optional if it can be inferred from other arguments.

-d, --hdf5-file

Path to the HDF5 file that contains the dataset. Optional if it can be inferred from other arguments.

-hr, --high-resolution

Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.

--batch-size

Batch size. Default is 8.

--num-workers

Number of workers for the dataloader. Default is 8.


chrombert_get_regulator_emb

Extract regulator embeddings from ChromBERT.

chrombert_get_regulator_emb [OPTIONS] SUPERVISED_FILE IDS... -o ONAME

Options

SUPERVISED_FILE

Path to the supervised file.

IDS

Regulator names to extract. Must be in lower case.

-o, --oname

Path to the output HDF5 file. This option is required.

--basedir

Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.

-g, --genome

Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.

-k, --ckpt

Path to the pretrain or fine-tuned checkpoint. Optional if it can be inferred from other arguments.

--meta

Path to the meta file. Optional if it can be inferred from other arguments.

--mask

Path to the matrix mask file. Optional if it can be inferred from other arguments.

-d, --hdf5-file

Path to the HDF5 file that contains the dataset. Optional if it can be inferred from other arguments.

-hr, --high-resolution

Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.

--batch-size

Batch size. Default is 8.

--num-workers

Number of workers for the dataloader. Default is 8.


chrombert_imputation_cistrome

Generate prediction result (full bigwig file or table) from ChromBERT when given cell type name, region and regulator.

Note

Either –o-bw or –o-table must be provided, depends on which format you want to output the results.

chrombert_imputation_cistrome [OPTIONS] SUPERVISED_FILE --o-bw BW_PATH --o-table TABLE_PATH --finetune-ckpt CKPT --prompt-kind KIND

Options

supervised_file

Path to the supervised file.

--o-bw

Path of the output BigWig file.

--o-table

Path to the output table if you want to output the table.

--prompt-kind

Prompt data class. Choose from cistrome or expression. This option is required.

--basedir

Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.

-g, --genome

Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.

--pretrain-ckpt

Path to the pretrain checkpoint. Optional if it could be inferred from other arguments.

-d, --hdf5-file

Path to the HDF5 file that contains the dataset. Optional if it could be inferred from other arguments.

-hr, --high-resolution

Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.

--finetune-ckpt

Path to the finetune checkpoint. Optional.

--prompt-dim-external

Dimension of external data. Use 512 for scGPT and 768 for ChromBERT’s embedding. Default is 512.

--prompt-celltype-cache-file

Path to the cell-type-specific prompt cache file. Optional.

--prompt-regulator-cache-file

Path to the regulator prompt cache file. Optional.

--prompt-celltype

The cell-type-specific prompt. For example, dnase:k562 for cistrome prompt and k562 for expression prompt. It can also be provided in the supervised file if the format supports. Optional.

--prompt-regulator

The regulator prompt. Determine the kind of output. For example, ctcf or h3k27ac. It can also be provided in the supervised file if the format supports. Optional.

--batch-size

Batch size. Default is 8.

--num-workers

Number of workers for the dataloader. Default is 8.


chrombert_imputation_cistrome_sc

Generate prediction result (hdf5 format) from ChromBERT when given single cell, region and regulator.

chrombert_imputation_cistrome_sc [OPTIONS] SUPERVISED_FILE --o-h5 H5_PATH --finetune-ckpt CKPT --prompt-kind KIND

Options

supervised_file

Path to the supervised file.

--o-h5

Path of the output HDF5 file. This option is required.

--prompt-kind

Prompt data class. Choose from cistrome or expression. This option is required.

--basedir

Base directory for the required files. Default is set to the value of DEFAULT_BASEDIR.

-g, --genome

Genome version. For example, hg38 or mm10. Only hg38 is supported now. Default is hg38.

--pretrain-ckpt

Path to the pretrain checkpoint. Optional if it could be inferred from other arguments.

-d, --hdf5-file

Path to the HDF5 file that contains the dataset. Optional if it could be inferred from other arguments.

-hr, --high-resolution

Use 200-bp resolution instead of 1-kb resolution. Caution: 200-bp resolution is preparing for the future release of ChromBERT, which is not available yet.

--finetune-ckpt

Path to the finetune checkpoint. Optional.

--prompt-dim-external

Dimension of external data. Use 512 for scGPT and 768 for ChromBERT’s embedding. Default is 512.

--prompt-celltype-cache-file

Path to the cell-type-specific prompt cache file. Optional.

--prompt-regulator-cache-file

Path to the regulator prompt cache file. Optional.

--prompt-regulator-cache-pin-memory
Pin memory for regulator prompt cache for further accelerating. Default is False.
--prompt-regulator-cache-limit
The limit of regulator prompt cached in memory. Be mindful of your memory usage!
--prompt-celltype

The cell-type-specific prompt. For example, dnase:k562 for cistrome prompt and k562 for expression prompt. It can also be provided in the supervised file if the format supports. Optional.

--prompt-regulator

The regulator prompt. Determine the kind of output. For example, ctcf or h3k27ac. It can also be provided in the supervised file if the format supports. Optional.

--batch-size

Batch size. Default is 8.

--num-workers

Number of workers for the dataloader. Default is 8.