GNAT (Gene Neighbourhood Analysis Tool) is a generalizable tool used to visualize the genomic neighbourhoods of protein(s) of interest and any homologous proteins found within a specified database. Either a protein FASTA file containing the protein(s) of interest or a PSI-BLAST .out file generated by GNAT can be used as input (for more information, see the help section below). This tool is particularly useful for exploring the genomic neighbourhoods of genes that tend to cluster, such as those in bacterial operons. Thus, the database on the server contains Staphylococcus aureus sequences from NCBI's ftp site.
Upon submitting a sequence to GNAT, the user will be redirected to a page with their job ID, which can be used to access the results when they are ready. GNAT uses the submitted FASTA sequence(s) as PSI-BLAST queries against the Staphylococcus aureus database. The matches found are then used to generate genomic neighbourhoods consisting of the (default=20) genes upstream and downstream of the match proteins. These genomic neighbourhoods are then filtered based on match percent identity (default=30%) and size (default=20%) versus the query protein. The filtered genomic neighbourhoods are then clustered into groups of approximately 20 and visualized using a modified version of clinker to generate interactive plots with the matches labelled. Phylogenetic trees and pie charts are also generated at every step of the pipeline to show the distribution of the matches across the database.
Either an amino acid sequence in FASTA format or PSI-BLAST output generated by GNAT may be used as the primary input. The amino acid sequence(s) consist of a header line followed by one or more sequence lines. Several FASTA entries may be provided, one after another, as shown below:
>Cas1 ASZ84974.1 [Staphylococcus argenteus]
MKDVIYVENHYFVTVKENSIKFRNVIDKSEKFYLFEEIEAIIFDHYKSYFSHKLVIKCIENDIAIIFCDKKHSPLTQLISSYGMTHRLQRIQSQFQLSGRTRDRIWKKIVVNKIINQSKCLENNLHNENVKLLVNLAKDVSSGDKSNKEAQAARIYFKDLYGKQFKRGRYNDIINSGLNYGYSILRSFIKKELALHGFEMSLDIKHHSKENPFNLADDIIEVFRPFIDNIVYDIVFKNNINTFDINEKKLLLNVLYEKCIIDKKVVRLLDSIKIVVQSLIKCYDENTPTPLSLPKMIEVGN
>Cas2 SNX31426.1 [Limosilactobacillus fermentum]
MRYRIMRLMVMFDLPTDTSQQRKQYRQFRKKLLNEGFIMIQYSVYVRVCTTRSAAEFLERRLKNYLPAQGIIQSLMLTEKQYSDMHFLLGDEIEEVRNSSKRTIVL
Note that the name used to label query matches are pulled from the item immediately following the '>' on the FASTA header line. Thus, matches to the first query will be labelled as 'Cas1', and labels to the second will be labelled as 'Cas2'. If the accession number is the first item of the line, matches will be labelled as the query accession rather than the query name.
To bypass the time-consuming PSI-BLAST step, PSI-BLAST output may be provided. This must be formatted the same as that generated by GNAT, the outfmt tag should be specified as such:
This causes each line of the PSI-BLAST output to consist of the query accession, match accession delimited by '|', percent identity of the match versus the query, query length, match length, and e-value of the match. These items are tab delimited, as follows:
Cas1 ./GCA_026515185.1_PDT001502071.1_genomic.gbff.gz|DAKKFC010000023|HCV8640298.1|cas1|18006|18911|type 95.681 301 301 1.07e-162
Cas1 ./GCA_027515845.1_PDT001554527.1_genomic.gbff.gz|DALVGR010000013|HDD3950907.1|cas1|52853|53758|type 95.681 301 301 1.07e-162
Cas1 ./GCA_027754765.1_PDT001566508.1_genomic.gbff.gz|DAMLLS010000002|HDF5833897.1|cas1|205885|206790|type 95.681 301 301 1.07e-162
Cas1 ./GCA_027966205.1_PDT001578719.1_genomic.gbff.gz|DAMTGY010000001|HDH2137809.1|cas1|27113|28018|type 95.681 301 301 1.07e-162
Cas1 ./GCA_027033375.1_PDT001534964.1_genomic.gbff.gz|DALEUO010000001|HDA3386104.1|cas1|206121|207026|type 95.681 301 301 1.07e-162
Cas1 ./GCA_027509165.1_PDT001554862.1_genomic.gbff.gz|DALUTV010000002|HDD3025489.1|cas1|18402|19307|type 95.681 301 301 1.07e-162
Cas2 ./GCA_000361085.1_Stap_aure_M0408_V1_genomic.gbff.gz|KB821326|ENK08569.1|?|746694|747002|CRISPR-associated 46.078 106 102 2.02e-26
Cas2 ./GCA_024020075.1_PDT001337643.1_genomic.gbff.gz|DAHQPH010000002|HCG2514186.1|cas2|35882|36190|CRISPR-associated 48.352 106 102 2.55e-26
Cas2 ./GCA_024022575.1_PDT001337516.1_genomic.gbff.gz|DAHQUB010000002|HCG2858496.1|cas2|35882|36190|CRISPR-associated 48.352 106 102 2.55e-26
As optional inputs, the user may specify a label file (generated as .label by GNAT) which is updated with new matches and used to label additional matches. If no label file is specified, a blank one will be generated and filled exclusively with the query-match mapping of the current job. The default e-value and number of iterations for PSI-BLAST searches may also be modified. In addition, the user may modify the default length threshold and PID threshold used to filter the generated genomic neighbourhoods, as well as the default number of flanking genes used to generate the genomic neighbourhoods. The user may also specify whether or not they wish phylogenetic trees and pie charts to be generated.
Important: output directories will be removed from the server 7 days after submission. Be sure to download any data prior to deletion.
Upon submission of a job, the user will be provided with a job ID. This ID can then be submitted to the job page. This will redirect the user to an output page containing the output generated by their job so far.
On the left side of the output page, the user will be presented with a directory browser with which they can browse through their output directory. In the root of the output directory, the user will find the FASTA file used as a query (if a FASTA sequence was input), the PSI-BLAST output file, the match label file, and a directory containing genomic neighbourhoods and clinker plots for each query that resulted in PSI-BLAST matches. These directories are further subdivided into a subdirectory containing clusters of all of the non-redundant genomic nieghbourhoods, one containing redundant genomic neighbourhoods (i.e., genomic neighbourhoods that were completely identical to another), a mapping file recording the number of redundant genomic neighbourhoods represented by each non-redundant genomic neighbourhood, and subdirectories containing the genomic neighbourhoods whose match was above and below the PID threshold. These PID subdirectories are further divided into subdirectories separating the genomic neighbourhoods into those whose match is within and outside of the length threshold of the query protein. Within these subdirectories are clusters of approximately 20 genomic neighbourhoods which were plotted with clinker. Finally, at each level in the output subdirectories phylogenetic trees and pie charts are generated if 'Generate Taxonomic Data' was left checked in the optional input settings.
When a file is selected in the directory browser, it will be opened in the right side of the output screen, where it can be downloaded or closed using the buttons on the right side of the file header. Furthermore, the entire output directory retrieved by the job ID may be downloaded using the button at the top of the directory browser.
Note that the .html clinker plots are most responsive on Google Chrome. If you are having problems loading/interacting with these plots, try using Chrome.