How to use the command-line interface
Cas-OFFinder is built upon OpenCL to identify potential off-target sites of CRISPR/Cas-derived RNA-guided endonucleases (RGENs). An OpenCL device is essential for the optimal functionality of Variant-aware Cas-OFFinder.
Create your environment:
conda create -n crispr
Download requirements.txt and vcf-cas-offinder.py from the command-line interface directory and install all dependencies using the command:
pip install —no-cache-dir -r requirements.txt
Download the Cas-OFFinder binary file from https://github.com/pnucolab/variant-aware-cas-offinder/raw/refs/heads/main/backend/cas-offinder in the same directory with vcf-cas-offinder.py.
Install the vcflib package using conda, execute the following command:
conda install -c bioconda vcflib=1.0.3 tabixpp=1.1.0
Download the chromosome FASTA files for any target organism. You can find one using the links below, or you can use any other sources.
For Vertebrates
https://ftp.ensembl.org/pub/
For Plants
https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/
Extract all FASTA files into a directory. Index the extracted reference genome within the same directory
samtools faidx ref.genome # replace ref.genome with tha actual name of the extracted reference genome
Ensure that the “+x” flag is added to the input_vcf_file and the target organism’s reference genome directory. Now, the Allelic Cas-OFFinder pipeline can run with:
./vcf-cas-offinder.py -i input_vcf_file_path -r reference_genome_path -t target_sequence_input_file_name -d device_id
- For device_id, you can use G, C, or A
G represents using GPU devices, while C stands for CPUs. A represents accelerators.
If you have multiple GPU or CPU IDs, you can specify them as G0 for GPU device ID 0 and G1 for ID 1 to limit the number of devices used.
For a short help, try running:
./vcf-cas-offinder.py -h
usage: vcf-cas-offinder.py [-h] -i INPUT -r REF_PATH -t QUERY_INPUT -d DEVICE_ID
Identify potential off-target sites based on VCF files.
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
input file name (Phased and single sample VCF file)
-r REF_PATH, --ref_path REF_PATH
Path to the target organism reference genome
-t QUERY_INPUT, --query_input QUERY_INPUT
target sequence in the target organism genome (input.txt file)
-d DEVICE_ID, --device_id DEVICE_ID
device_id(s): C for CPU and G for GPU, G0 for GPU device id=0
You should create an input.txt file in the same directory with vcf-cas-offinder. An example of an input file:
NNNNNNNNNNNNNNNNNNNNGG
GTGAAATCTAAGTGTAGAGNNN 2
TTGTGAAATCTAAGTGTAGNNN 2
CTTCACAATTATTCGCCCANNN 2
GGGCGAATAATTGTGAAGGNNN 2
CTTACAGAAACACCTGTTANNN 2
AGATTCAAGAATTGGTACGNNN 2
AACCTTCAGTTAGTCGCTANNN 2
CACCATAGCGACTAACTGANNN 2
AGCTCAGGAAGGCCCTCATNNN 2
The first line indicates the desired pattern, including the PAM site.
The remaining lines are the query sequences and maximum mismatch numbers, separated by spaces.
The length of the desired pattern and the query sequences should be the same.
Now you can run Variant-aware Cas-OFFinder as follows (using GPUs):
./allelic-cas-offinder.py -i bgzipresultcm334.vcf.gz -r /home/user/genome/pepper_ref/GCA_000512255.2_ASM51225v2_genomic.fa -t input.txt -d G
The sample result is given below. For this analysis we used the Pepper cultivar (CM334) genome with 2 mismatches.
GTGAAATCTAAGTGTAGAGNNN CVCM334_CM008455.1:0 15539504 aaGAAATCTAAGTGTAGAGTGG - 2
TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008455.1:0 195285628 TTtTGAAAaCTAAGTGTAGAGG + 2
GTGAAATCTAAGTGTAGAGNNN CVCM334_CM008455.1:1 15539613 aaGAAATCTAAGTGTAGAGTGG - 2
TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008455.1:1 195287846 TTtTGAAAaCTAAGTGTAGAGG + 2
GTGAAATCTAAGTGTAGAGNNN CVCM334_CM008456.1:0 150109371 GTGAAATCTAAGTGTAGAGGGG - 0
TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:0 29642154 TTGTGAgtTCTAAGTGTAGCGG + 2
TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:0 77628291 TTGTcAAATCTAAGaGTAGAGG + 2
TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:0 95688428 TTGTGAAAaCTAAGTGTAaAGG - 2
TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:0 150109373 TTGTGAAATCTAAGTGTAGAGG - 0
CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:0 150076867 CTTCAtAgTTATTCGCCCAAGG + 2
CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:0 150071663 CTTCAtAgTTATTCGCCCAAGG + 2
CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:0 150089959 CTTCAtAATTATTtGCCCAAGG + 2
CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:0 150109711 CTTCACAATTATTCGCCCAAGG - 0
CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:0 150133601 CTTCAtAATTATTtGCCCAAGG - 2
GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:0 150076863 GGGCGAATAAcTaTGAAGGTGG - 2
GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:0 150071659 GGGCGAATAAcTaTGAAGGTGG - 2
GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:0 150089955 GGGCaAATAATTaTGAAGGTGG - 2
GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:0 150109715 GGGCGAATAATTGTGAAGGTGG + 0
GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:0 150133605 GGGCaAATAATTaTGAAGGTGG + 2
GTGAAATCTAAGTGTAGAGNNN CVCM334_CM008456.1:1 150111631 GTGAAATCTAAGTGTAGAGGGG - 0
TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:1 29642642 TTGTGAgtTCTAAGTGTAGCGG + 2
TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:1 77629000 TTGTcAAATCTAAGaGTAGAGG + 2
TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:1 95689442 TTGTGAAAaCTAAGTGTAaAGG - 2
TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:1 150111633 TTGTGAAATCTAAGTGTAGAGG - 0
CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:1 150079117 CTTCAtAgTTATTCGCCCAAGG + 2
CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:1 150073913 CTTCAtAgTTATTCGCCCAAGG + 2
CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:1 150092209 CTTCAtAATTATTtGCCCAAGG + 2
CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:1 150135873 CTTCAtAATTATTtGCCCAAGG - 2
CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:1 150111971 CTTCACAATTATTCGCCCAAGG - 0
GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:1 150111975 GGGCGAATAATTGTGAAGGTGG + 0
GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:1 150135877 GGGCaAATAATTaTGAAGGTGG + 2
GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:1 150079113 GGGCGAATAAcTaTGAAGGTGG - 2
GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:1 150073909 GGGCGAATAAcTaTGAAGGTGG - 2
GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:1 150092205 GGGCaAATAATTaTGAAGGTGG - 2
0 after the colon in the second column represents allele 1, and 1 represents allele 2 for each chromosome. In the example shown above, CVCM334_CM008455, CVCM334_CM008456, etc, are chromosome identifiers found in the allelic fasta files.