Genepidgin select

Goals

Genepidgin select generates gene product names from alignments to proteins in curated libraries (currently FIGfam, KEGG, Pfam, RefSeq, SwissProt and TIGRFAM). Blast and hmmer alignments from those libraries are read into Genepidgin via simple data formats (.pidginb and .pidginh, respectively), where they are sifted through to find the best name.

Selection Recipe

Summary

Sort qualifying sources, preferring: hmmer alignments to blast alignments, a lower e-value in hmmer hits, and a higher percent identity in blast hits. Walk through the sorted list until we find a name that remains informative after running through Genepidgin cleaner.

Details

Group all evidence by dest_id and consider each dest_id independently.

Over the course of this search, if a name filters to something uninformative (via Genepidgin cleaner), then examine the next relevant source, until either a valid source and name are found, or no sources remain and the name “hypothetical protein” is assigned.

Start by examining the hmmer hits. Remove hits that are neither TIGRFAM equivalogs nor Pfam hits labeled as equivalog-equivalents by JCVI. Next, remove hits whose score is less than its family_trusted_cutoff (see .pidginh). Take the name of the hit with the lowest e-value. If multiple hits have equivalent e-values, select the hit with the highest bit score.

If a dest_id has no hmmer hits deemed sutable for naming, examine the blast evidence (see .pidginb or .blastm8), calculating the following terms:

source_coverage = (source_stop - source_start + 1) / source_len
dest_coverage = (dest_stop - dest_start + 1) / dest_len
min_coverage = min(source_coverage, dest_coverage)

source_pct_identity = num_identities / source_len
dest_pct_identity = num_identities / dest_len
min_pct_identity = min(source_pct_identity, dest_pct_identity)

upper_pct_identity = max(min_pct_identity for all hits whose min_coverage ≥ 0.6)
lower_pct_identity = max(0.5, upper_pct_identity - 0.05)

Cluster all hits associated with dest_id that have min_coverage ≥ 0.6 and whose min_pct_identity is between upper_pct_identity and lower_pct_identity (inclusive). If upper_pct_identity < lower_pct_identity, ignore all hits.

If the cluster is not empty, and any of the hits in the cluster has a source_auth (see .pidginb) of KEGG, then select the name from the one with the highest min_pct_identity. If there are no hits from KEGG, proceed to SwissProt hits, then FIGfam and finally RefSeq, searching in each bin for the hit with the highest min_pct_identity within that bin.

Usage

Given a series of data files, use the selection recipe described above to determine product names for the given genes.

genepidgin select (options) [inputfiles]

options:
  -o --output    : where to save files, defaults to ./pidgin_names.txt
  -e --etymology : where to save etymology (debug), defaults to ./pidgin_etymology.txt
  -h --help      : this information
  --use_custom_blast_cutoffs : use different cutoffs for different sources

The format of Input and Output files are described below.

Input

Any number of input files following the following three formats are permitted. The ordering of the files, and the ordering of the lines within the files, does not matter. No tabs, newlines, or control characters are permitted in any of these fields.

.pidginb

All files with the extension .pidginb are assumed to contain BLAST alignments.

Each line in a .pidginb file will consist of the following tab-separated fields:

0.  dest_id STRING an identifier for a destination protein (i.e., a protein that
    should receive a name)
1.  dest_start INTEGER 1-based index of first aligned amino acid in destination
    protein
2.  dest_stop INTEGER 1-based index of last aligned amino acid in destination
    protein
3.  dest_len INTEGER number of amino acids in destination protein
4.  source_id STRING an identifier for a source protein (i.e., a protein whose
    name should be considered for assignment to the destination protein)
5.  source_start INTEGER 1-based index of first aligned amino acid in source
    protein
6.  source_stop INTEGER 1-based index of last aligned amino acid in source
    protein
7.  source_len INTEGER number of amino acids in source protein
8.  source_auth STRING the source of the data, used for heuristic processing,
    must be one of:
      - "FIGfam"
      - "KEGG"
      - "RefSeq"
      - "SwissProt"
9.  num_identities INTEGER number of exact amino acid matches in alignment
10. num_similarities INTEGER number of similar amino acid matches in alignment
11. raw_name STRING the name of the source protein
12. comment STRING can be used for any purpose

A sample line:

7000002454063496        134     581     448     7000000120703332        127     596     470     FIGfam      151     227     FIG029094-5 IncW plasmid conjugative protein TrwB (TraD homolog)

.pidginh

All files with the extension .pidginh are assumed to contain HMMER alignments.

Note: per-domain scores are ignored; we consider the whole hit only.

Each line in a .pidginh file will consist of the following tab-separated fields:

0.  dest_id STRING an identifier for a destination protein (i.e., a protein that
    should receive a name)
1.  dest_start INTEGER 1-based index of first aligned amino acid in destination
    protein
2.  dest_stop INTEGER 1-based index of last aligned amino acid in destination
    protein
3.  dest_len INTEGER number of amino acids in destination protein
4.  source_id STRING an identifier for a source family (i.e., a profile whose
    name should be considered for assignment to the destination protein)
    currently should be a TIGRFAM or Pfam id.
5.  source_start INTEGER 1-based index of first aligned position in source family
6.  source_stop INTEGER 1-based index of last aligned position in source family
7.  source_len INTEGER number of positions in source family
8.  score FLOAT score reported by hmmer
9.  family_trusted_cutoff FLOAT
10. e_value FLOAT+INTEGER in the format X.XXeY where X.XX is a positive float and Y is an integer
11. raw_name STRING the name of the source family
12. comment STRING can be used for any purpose

A sample line:

7000002454071269        3       140     138     TIGRfam 13      155     143     83.519997       80.000000   -21.585027      ribosomal-protein-alanine acetyltransferase

.blastm8

Blast -m8 format is also acceptable, but requires also submitting a name key via --ref, as m8 format contains no names. This method also has much slower execution time.

It is assumed that all names derived from .blastm8 have lower priority than other sources.

Output

The names of these files are governed by the option usage, as described above.

Names

Each line of the name file has four columns:

0. dest_id STRING an identifier for a destination protein (i.e., a protein that should receive a name)
1. name STRING the best available name for the destination protein
2. source_id STRING the id of the blast or hmmer hit used to name this protein
3. comment STRING the comment field from the line used to name this protein

A snippet from a names.txt from a development run:

7000002454076078   fructose-1-6-bisphosphatase   FIGfam    run on library updated 2009/10/22
7000002454076081   hypothetical protein          (blank)   (blank)

Note that hypothetical proteins don’t have the final two fields, as they did not pick up a name from the given sources.

Etymology

The etymology file consists of a sequence of entries. Each entry describes the process by which the resulting name was given, showing tracking information as data is discarded and then summary information of how the name was cleaned up (plugs directly into Genepidgin cleaner) before it is presented.

Entries are separated by five equals signs and a newline: =====

Each entry begins with the dest_id alone on the first line of the block. Convenient for searching!

A snippet of a local run:

7000002454076078
1 hmmer source found.
0 hmmer sources were removed due to not meeting the trusted family score.
One hmmer source had a good name.
Found an acceptable name in the hmmer sources. The one we liked best came from:
./test/Rho_sphaeroides_241_HMMERTRANSCRIPTS_17.pidginh:2013
This source's name was cleaned up by genepidgin:
filtered name in 1 step:
0) original: Fructose-1-6-bisphosphatase
1)   reason: protein names should not start with a capital letter
    pattern: (?:(?<=similar to )|^)([A-Z])(?=[a-z][a-z]+([ /,-]|$))
   filtered: fructose-1-6-bisphosphatase
Final name: fructose-1-6-bisphosphatase
=====
7000002454076081
0 hmmer sources found.
No name was derived from hmmer sources.
2 blast sources were found.
0 blast sources were removed by filtering for low coverage (<0.6).
The highest percent identity of any remaining blast source is 0.992. The lowest is 0.945.
0 blast sources were removed due to not being within the percent identity window (0.992, 0.942).
All 2 blast sources had names that filtered to nothing.
No name was ultimately selected from any of the supplied sources.
Final name: hypothetical protein

use_custom_blast_cutoffs

As hardcoded in pidgin.select, widens the cutoff margin for blast hits (currently KEGG-only, check source for details).

Note

Please see Credits for contributor information.

Project Versions

Table Of Contents

Previous topic

Genepidgin compare

Next topic

Credits

This Page