How to use the command-line interface ===================================== Cas-OFFinder is built upon OpenCL to identify potential off-target sites of CRISPR/Cas-derived RNA-guided endonucleases (RGENs). An OpenCL device is essential for the optimal functionality of Variant-aware Cas-OFFinder. Create your environment: .. code-block:: bash conda create -n crispr Download requirements.txt and vcf-cas-offinder.py from the command-line interface directory and install all dependencies using the command: .. code-block:: bash pip install —no-cache-dir -r requirements.txt Download the Cas-OFFinder binary file from https://github.com/pnucolab/variant-aware-cas-offinder/raw/refs/heads/main/backend/cas-offinder in the same directory with vcf-cas-offinder.py. Install the vcflib package using conda, execute the following command: .. code-block:: bash conda install -c bioconda vcflib=1.0.3 tabixpp=1.1.0 Download the chromosome FASTA files for any target organism. You can find one using the links below, or you can use any other sources. - For Vertebrates .. code-block:: bash https://ftp.ensembl.org/pub/ - For Plants .. code-block:: bash https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/ Extract all FASTA files into a directory. Index the extracted reference genome within the same directory .. code-block:: bash samtools faidx ref.genome # replace ref.genome with tha actual name of the extracted reference genome Ensure that the “+x” flag is added to the input_vcf_file and the target organism’s reference genome directory. Now, the Allelic Cas-OFFinder pipeline can run with: .. code-block:: bash ./vcf-cas-offinder.py -i input_vcf_file_path -r reference_genome_path -t target_sequence_input_file_name -d device_id For device_id, you can use G, C, or A - G represents using GPU devices, while C stands for CPUs. A represents accelerators. - If you have multiple GPU or CPU IDs, you can specify them as G0 for GPU device ID 0 and G1 for ID 1 to limit the number of devices used. For a short help, try running: .. code-block:: bash ./vcf-cas-offinder.py -h .. code-block:: bash usage: vcf-cas-offinder.py [-h] -i INPUT -r REF_PATH -t QUERY_INPUT -d DEVICE_ID Identify potential off-target sites based on VCF files. options: -h, --help show this help message and exit -i INPUT, --input INPUT input file name (Phased and single sample VCF file) -r REF_PATH, --ref_path REF_PATH Path to the target organism reference genome -t QUERY_INPUT, --query_input QUERY_INPUT target sequence in the target organism genome (input.txt file) -d DEVICE_ID, --device_id DEVICE_ID device_id(s): C for CPU and G for GPU, G0 for GPU device id=0 You should create an input.txt file in the same directory with vcf-cas-offinder. An example of an input file: .. code-block:: bash NNNNNNNNNNNNNNNNNNNNGG GTGAAATCTAAGTGTAGAGNNN 2 TTGTGAAATCTAAGTGTAGNNN 2 CTTCACAATTATTCGCCCANNN 2 GGGCGAATAATTGTGAAGGNNN 2 CTTACAGAAACACCTGTTANNN 2 AGATTCAAGAATTGGTACGNNN 2 AACCTTCAGTTAGTCGCTANNN 2 CACCATAGCGACTAACTGANNN 2 AGCTCAGGAAGGCCCTCATNNN 2 - The first line indicates the desired pattern, including the PAM site. - The remaining lines are the query sequences and maximum mismatch numbers, separated by spaces. - The length of the desired pattern and the query sequences should be the same. Now you can run Variant-aware Cas-OFFinder as follows (using GPUs): .. code-block:: bash ./allelic-cas-offinder.py -i bgzipresultcm334.vcf.gz -r /home/user/genome/pepper_ref/GCA_000512255.2_ASM51225v2_genomic.fa -t input.txt -d G The sample result is given below. For this analysis we used the Pepper cultivar (CM334) genome with 2 mismatches. .. code-block:: bash GTGAAATCTAAGTGTAGAGNNN CVCM334_CM008455.1:0 15539504 aaGAAATCTAAGTGTAGAGTGG - 2 TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008455.1:0 195285628 TTtTGAAAaCTAAGTGTAGAGG + 2 GTGAAATCTAAGTGTAGAGNNN CVCM334_CM008455.1:1 15539613 aaGAAATCTAAGTGTAGAGTGG - 2 TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008455.1:1 195287846 TTtTGAAAaCTAAGTGTAGAGG + 2 GTGAAATCTAAGTGTAGAGNNN CVCM334_CM008456.1:0 150109371 GTGAAATCTAAGTGTAGAGGGG - 0 TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:0 29642154 TTGTGAgtTCTAAGTGTAGCGG + 2 TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:0 77628291 TTGTcAAATCTAAGaGTAGAGG + 2 TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:0 95688428 TTGTGAAAaCTAAGTGTAaAGG - 2 TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:0 150109373 TTGTGAAATCTAAGTGTAGAGG - 0 CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:0 150076867 CTTCAtAgTTATTCGCCCAAGG + 2 CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:0 150071663 CTTCAtAgTTATTCGCCCAAGG + 2 CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:0 150089959 CTTCAtAATTATTtGCCCAAGG + 2 CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:0 150109711 CTTCACAATTATTCGCCCAAGG - 0 CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:0 150133601 CTTCAtAATTATTtGCCCAAGG - 2 GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:0 150076863 GGGCGAATAAcTaTGAAGGTGG - 2 GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:0 150071659 GGGCGAATAAcTaTGAAGGTGG - 2 GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:0 150089955 GGGCaAATAATTaTGAAGGTGG - 2 GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:0 150109715 GGGCGAATAATTGTGAAGGTGG + 0 GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:0 150133605 GGGCaAATAATTaTGAAGGTGG + 2 GTGAAATCTAAGTGTAGAGNNN CVCM334_CM008456.1:1 150111631 GTGAAATCTAAGTGTAGAGGGG - 0 TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:1 29642642 TTGTGAgtTCTAAGTGTAGCGG + 2 TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:1 77629000 TTGTcAAATCTAAGaGTAGAGG + 2 TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:1 95689442 TTGTGAAAaCTAAGTGTAaAGG - 2 TTGTGAAATCTAAGTGTAGNNN CVCM334_CM008456.1:1 150111633 TTGTGAAATCTAAGTGTAGAGG - 0 CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:1 150079117 CTTCAtAgTTATTCGCCCAAGG + 2 CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:1 150073913 CTTCAtAgTTATTCGCCCAAGG + 2 CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:1 150092209 CTTCAtAATTATTtGCCCAAGG + 2 CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:1 150135873 CTTCAtAATTATTtGCCCAAGG - 2 CTTCACAATTATTCGCCCANNN CVCM334_CM008456.1:1 150111971 CTTCACAATTATTCGCCCAAGG - 0 GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:1 150111975 GGGCGAATAATTGTGAAGGTGG + 0 GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:1 150135877 GGGCaAATAATTaTGAAGGTGG + 2 GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:1 150079113 GGGCGAATAAcTaTGAAGGTGG - 2 GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:1 150073909 GGGCGAATAAcTaTGAAGGTGG - 2 GGGCGAATAATTGTGAAGGNNN CVCM334_CM008456.1:1 150092205 GGGCaAATAATTaTGAAGGTGG - 2 - 0 after the colon in the second column represents allele 1, and 1 represents allele 2 for each chromosome. In the example shown above, CVCM334_CM008455, CVCM334_CM008456, etc, are chromosome identifiers found in the allelic fasta files.