How to use the command-line interface

Cas-OFFinder is built upon OpenCL to identify potential off-target sites of CRISPR/Cas-derived RNA-guided endonucleases (RGENs). An OpenCL device is essential for the optimal functionality of Variant-aware Cas-OFFinder.

Create your environment:

conda create -n crispr

Download requirements.txt and vcf-cas-offinder.py from the command-line interface directory and install all dependencies using the command:

pip install —no-cache-dir -r requirements.txt

Download the Cas-OFFinder binary file from https://github.com/pnucolab/variant-aware-cas-offinder/raw/refs/heads/main/backend/cas-offinder in the same directory with vcf-cas-offinder.py.

Install the vcflib package using conda, execute the following command:

conda install -c bioconda vcflib=1.0.3 tabixpp=1.1.0

Download the chromosome FASTA files for any target organism. You can find one using the links below, or you can use any other sources.

For Vertebrates

https://ftp.ensembl.org/pub/

For Plants

https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/

Extract all FASTA files into a directory. Index the extracted reference genome within the same directory

samtools faidx ref.genome # replace ref.genome with tha actual name of the extracted reference genome

Ensure that the “+x” flag is added to the input_vcf_file and the target organism’s reference genome directory. Now, the Allelic Cas-OFFinder pipeline can run with:

./vcf-cas-offinder.py -i input_vcf_file_path -r reference_genome_path -t target_sequence_input_file_name -d device_id

For device_id, you can use G, C, or A

G represents using GPU devices, while C stands for CPUs. A represents accelerators.
If you have multiple GPU or CPU IDs, you can specify them as G0 for GPU device ID 0 and G1 for ID 1 to limit the number of devices used.

For a short help, try running:

./vcf-cas-offinder.py -h

usage: vcf-cas-offinder.py [-h] -i INPUT -r REF_PATH -t QUERY_INPUT -d DEVICE_ID

Identify potential off-target sites based on VCF files.

options:
-h, --help            show this help message and exit
-i INPUT, --input INPUT
                     input file name (Phased and single sample VCF file)
-r REF_PATH, --ref_path REF_PATH
                     Path to the target organism reference genome
-t QUERY_INPUT, --query_input QUERY_INPUT
                     target sequence in the target organism genome (input.txt file)
-d DEVICE_ID, --device_id DEVICE_ID
                     device_id(s): C for CPU and G for GPU, G0 for GPU device id=0

You should create an input.txt file in the same directory with vcf-cas-offinder. An example of an input file:

NNNNNNNNNNNNNNNNNNNNGG
GTGAAATCTAAGTGTAGAGNNN 2
TTGTGAAATCTAAGTGTAGNNN 2
CTTCACAATTATTCGCCCANNN 2
GGGCGAATAATTGTGAAGGNNN 2
CTTACAGAAACACCTGTTANNN 2
AGATTCAAGAATTGGTACGNNN 2
AACCTTCAGTTAGTCGCTANNN 2
CACCATAGCGACTAACTGANNN 2
AGCTCAGGAAGGCCCTCATNNN 2

The first line indicates the desired pattern, including the PAM site.
The remaining lines are the query sequences and maximum mismatch numbers, separated by spaces.
The length of the desired pattern and the query sequences should be the same.

Now you can run Variant-aware Cas-OFFinder as follows (using GPUs):

./allelic-cas-offinder.py -i bgzipresultcm334.vcf.gz -r /home/user/genome/pepper_ref/GCA_000512255.2_ASM51225v2_genomic.fa -t input.txt -d G

The sample result is given below. For this analysis we used the Pepper cultivar (CM334) genome with 2 mismatches.

GTGAAATCTAAGTGTAGAGNNN      CVCM334_CM008455.1:0    15539504        aaGAAATCTAAGTGTAGAGTGG  -       2
TTGTGAAATCTAAGTGTAGNNN      CVCM334_CM008455.1:0    195285628       TTtTGAAAaCTAAGTGTAGAGG  +       2
GTGAAATCTAAGTGTAGAGNNN      CVCM334_CM008455.1:1    15539613        aaGAAATCTAAGTGTAGAGTGG  -       2
TTGTGAAATCTAAGTGTAGNNN      CVCM334_CM008455.1:1    195287846       TTtTGAAAaCTAAGTGTAGAGG  +       2
GTGAAATCTAAGTGTAGAGNNN      CVCM334_CM008456.1:0    150109371       GTGAAATCTAAGTGTAGAGGGG  -       0
TTGTGAAATCTAAGTGTAGNNN      CVCM334_CM008456.1:0    29642154        TTGTGAgtTCTAAGTGTAGCGG  +       2
TTGTGAAATCTAAGTGTAGNNN      CVCM334_CM008456.1:0    77628291        TTGTcAAATCTAAGaGTAGAGG  +       2
TTGTGAAATCTAAGTGTAGNNN      CVCM334_CM008456.1:0    95688428        TTGTGAAAaCTAAGTGTAaAGG  -       2
TTGTGAAATCTAAGTGTAGNNN      CVCM334_CM008456.1:0    150109373       TTGTGAAATCTAAGTGTAGAGG  -       0
CTTCACAATTATTCGCCCANNN      CVCM334_CM008456.1:0    150076867       CTTCAtAgTTATTCGCCCAAGG  +       2
CTTCACAATTATTCGCCCANNN      CVCM334_CM008456.1:0    150071663       CTTCAtAgTTATTCGCCCAAGG  +       2
CTTCACAATTATTCGCCCANNN      CVCM334_CM008456.1:0    150089959       CTTCAtAATTATTtGCCCAAGG  +       2
CTTCACAATTATTCGCCCANNN      CVCM334_CM008456.1:0    150109711       CTTCACAATTATTCGCCCAAGG  -       0
CTTCACAATTATTCGCCCANNN      CVCM334_CM008456.1:0    150133601       CTTCAtAATTATTtGCCCAAGG  -       2
GGGCGAATAATTGTGAAGGNNN      CVCM334_CM008456.1:0    150076863       GGGCGAATAAcTaTGAAGGTGG  -       2
GGGCGAATAATTGTGAAGGNNN      CVCM334_CM008456.1:0    150071659       GGGCGAATAAcTaTGAAGGTGG  -       2
GGGCGAATAATTGTGAAGGNNN      CVCM334_CM008456.1:0    150089955       GGGCaAATAATTaTGAAGGTGG  -       2
GGGCGAATAATTGTGAAGGNNN      CVCM334_CM008456.1:0    150109715       GGGCGAATAATTGTGAAGGTGG  +       0
GGGCGAATAATTGTGAAGGNNN      CVCM334_CM008456.1:0    150133605       GGGCaAATAATTaTGAAGGTGG  +       2
GTGAAATCTAAGTGTAGAGNNN      CVCM334_CM008456.1:1    150111631       GTGAAATCTAAGTGTAGAGGGG  -       0
TTGTGAAATCTAAGTGTAGNNN      CVCM334_CM008456.1:1    29642642        TTGTGAgtTCTAAGTGTAGCGG  +       2
TTGTGAAATCTAAGTGTAGNNN      CVCM334_CM008456.1:1    77629000        TTGTcAAATCTAAGaGTAGAGG  +       2
TTGTGAAATCTAAGTGTAGNNN      CVCM334_CM008456.1:1    95689442        TTGTGAAAaCTAAGTGTAaAGG  -       2
TTGTGAAATCTAAGTGTAGNNN      CVCM334_CM008456.1:1    150111633       TTGTGAAATCTAAGTGTAGAGG  -       0
CTTCACAATTATTCGCCCANNN      CVCM334_CM008456.1:1    150079117       CTTCAtAgTTATTCGCCCAAGG  +       2
CTTCACAATTATTCGCCCANNN      CVCM334_CM008456.1:1    150073913       CTTCAtAgTTATTCGCCCAAGG  +       2
CTTCACAATTATTCGCCCANNN      CVCM334_CM008456.1:1    150092209       CTTCAtAATTATTtGCCCAAGG  +       2
CTTCACAATTATTCGCCCANNN      CVCM334_CM008456.1:1    150135873       CTTCAtAATTATTtGCCCAAGG  -       2
CTTCACAATTATTCGCCCANNN      CVCM334_CM008456.1:1    150111971       CTTCACAATTATTCGCCCAAGG  -       0
GGGCGAATAATTGTGAAGGNNN      CVCM334_CM008456.1:1    150111975       GGGCGAATAATTGTGAAGGTGG  +       0
GGGCGAATAATTGTGAAGGNNN      CVCM334_CM008456.1:1    150135877       GGGCaAATAATTaTGAAGGTGG  +       2
GGGCGAATAATTGTGAAGGNNN      CVCM334_CM008456.1:1    150079113       GGGCGAATAAcTaTGAAGGTGG  -       2
GGGCGAATAATTGTGAAGGNNN      CVCM334_CM008456.1:1    150073909       GGGCGAATAAcTaTGAAGGTGG  -       2
GGGCGAATAATTGTGAAGGNNN      CVCM334_CM008456.1:1    150092205       GGGCaAATAATTaTGAAGGTGG  -       2

0 after the colon in the second column represents allele 1, and 1 represents allele 2 for each chromosome. In the example shown above, CVCM334_CM008455, CVCM334_CM008456, etc, are chromosome identifiers found in the allelic fasta files.