Gene Neighbourhood Analysis Tool

Help

Input Help

Either an amino acid sequence in FASTA format or PSI-BLAST output generated by GNAT may be used as the primary input. The amino acid sequence(s) consist of a header line followed by one or more sequence lines. Several FASTA entries may be provided, one after another, as shown below:

>Cas1 ASZ84974.1 [Staphylococcus argenteus]

MKDVIYVENHYFVTVKENSIKFRNVIDKSEKFYLFEEIEAIIFDHYKSYFSHKLVIKCIENDIAIIFCDKKHSPLTQLISSYGMTHRLQRIQSQFQLSGRTRDRIWKKIVVNKIINQSKCLENNLHNENVKLLVNLAKDVSSGDKSNKEAQAARIYFKDLYGKQFKRGRYNDIINSGLNYGYSILRSFIKKELALHGFEMSLDIKHHSKENPFNLADDIIEVFRPFIDNIVYDIVFKNNINTFDINEKKLLLNVLYEKCIIDKKVVRLLDSIKIVVQSLIKCYDENTPTPLSLPKMIEVGN

>Cas2 SNX31426.1 [Limosilactobacillus fermentum]

MRYRIMRLMVMFDLPTDTSQQRKQYRQFRKKLLNEGFIMIQYSVYVRVCTTRSAAEFLERRLKNYLPAQGIIQSLMLTEKQYSDMHFLLGDEIEEVRNSSKRTIVL

Note that the name used to label query matches are pulled from the item immediately following the '>' on the FASTA header line. Thus, matches to the first query will be labelled as 'Cas1', and labels to the second will be labelled as 'Cas2'. If the accession number is the first item of the line, matches will be labelled as the query accession rather than the query name.

To bypass the time-consuming PSI-BLAST step, PSI-BLAST output may be provided. This must be formatted the same as that generated by GNAT, the outfmt tag should be specified as such:

-outfmt "6 qacc sacc pident qlen slen evalue"

This causes each line of the PSI-BLAST output to consist of the query accession, match accession delimited by '|', percent identity of the match versus the query, query length, match length, and e-value of the match. These items are tab delimited, as follows:

Cas1 ./GCA_026515185.1_PDT001502071.1_genomic.gbff.gz|DAKKFC010000023|HCV8640298.1|cas1|18006|18911|type 95.681 301 301 1.07e-162

Cas1 ./GCA_027515845.1_PDT001554527.1_genomic.gbff.gz|DALVGR010000013|HDD3950907.1|cas1|52853|53758|type 95.681 301 301 1.07e-162

Cas1 ./GCA_027754765.1_PDT001566508.1_genomic.gbff.gz|DAMLLS010000002|HDF5833897.1|cas1|205885|206790|type 95.681 301 301 1.07e-162

Cas1 ./GCA_027966205.1_PDT001578719.1_genomic.gbff.gz|DAMTGY010000001|HDH2137809.1|cas1|27113|28018|type 95.681 301 301 1.07e-162

Cas1 ./GCA_027033375.1_PDT001534964.1_genomic.gbff.gz|DALEUO010000001|HDA3386104.1|cas1|206121|207026|type 95.681 301 301 1.07e-162

Cas1 ./GCA_027509165.1_PDT001554862.1_genomic.gbff.gz|DALUTV010000002|HDD3025489.1|cas1|18402|19307|type 95.681 301 301 1.07e-162

Cas2 ./GCA_000361085.1_Stap_aure_M0408_V1_genomic.gbff.gz|KB821326|ENK08569.1|?|746694|747002|CRISPR-associated 46.078 106 102 2.02e-26

Cas2 ./GCA_024020075.1_PDT001337643.1_genomic.gbff.gz|DAHQPH010000002|HCG2514186.1|cas2|35882|36190|CRISPR-associated 48.352 106 102 2.55e-26

Cas2 ./GCA_024022575.1_PDT001337516.1_genomic.gbff.gz|DAHQUB010000002|HCG2858496.1|cas2|35882|36190|CRISPR-associated 48.352 106 102 2.55e-26

As optional inputs, the user may specify a label file (generated as .label by GNAT) which is updated with new matches and used to label additional matches. If no label file is specified, a blank one will be generated and filled exclusively with the query-match mapping of the current job. The default e-value and number of iterations for PSI-BLAST searches may also be modified. In addition, the user may modify the default length threshold and PID threshold used to filter the generated genomic neighbourhoods, as well as the default number of flanking genes used to generate the genomic neighbourhoods. The user may also specify whether or not they wish phylogenetic trees and pie charts to be generated.


Output Help

Important: output directories will be removed from the server 7 days after submission. Be sure to download any data prior to deletion.

Upon submission of a job, the user will be provided with a job ID. This ID can then be submitted to the job page. This will redirect the user to an output page containing the output generated by their job so far.

On the left side of the output page, the user will be presented with a directory browser with which they can browse through their output directory. In the root of the output directory, the user will find the FASTA file used as a query (if a FASTA sequence was input), the PSI-BLAST output file, the match label file, and a directory containing genomic neighbourhoods and clinker plots for each query that resulted in PSI-BLAST matches. These directories are further subdivided into a subdirectory containing clusters of all of the non-redundant genomic nieghbourhoods, one containing redundant genomic neighbourhoods (i.e., genomic neighbourhoods that were completely identical to another), a mapping file recording the number of redundant genomic neighbourhoods represented by each non-redundant genomic neighbourhood, and subdirectories containing the genomic neighbourhoods whose match was above and below the PID threshold. These PID subdirectories are further divided into subdirectories separating the genomic neighbourhoods into those whose match is within and outside of the length threshold of the query protein. Within these subdirectories are clusters of approximately 20 genomic neighbourhoods which were plotted with clinker. Finally, at each level in the output subdirectories phylogenetic trees and pie charts are generated if 'Generate Taxonomic Data' was left checked in the optional input settings.

When a file is selected in the directory browser, it will be opened in the right side of the output screen, where it can be downloaded or closed using the buttons on the right side of the file header. Furthermore, the entire output directory retrieved by the job ID may be downloaded using the button at the top of the directory browser.

Note that the .html clinker plots are most responsive on Google Chrome. If you are having problems loading/interacting with these plots, try using Chrome.