Genepidgin *select*
===================

Goals
-----

Genepidgin *select* generates gene product names from alignments to proteins in curated libraries (currently FIGfam, KEGG, Pfam, RefSeq, SwissProt and TIGRFAM). Blast and hmmer alignments from those libraries are read into Genepidgin via simple data formats (`.pidginb`_ and `.pidginh`_, respectively), where they are sifted through to find the best name.

Selection Recipe
----------------

Summary
~~~~~~~

Sort qualifying sources, preferring: hmmer alignments to blast alignments, a lower e-value in hmmer hits, and a higher percent identity in blast hits. Walk through the sorted list until we find a name that remains informative after running through :doc:`cleaner`.

Details
~~~~~~~

Group all evidence by ``dest_id`` and consider each ``dest_id`` independently.

Over the course of this search, if a name filters to something uninformative (via :doc:`cleaner`), then examine the next relevant source, until either a valid source and name are found, or no sources remain and the name "hypothetical protein" is assigned.

Start by examining the hmmer hits. Remove hits that are neither TIGRFAM equivalogs nor Pfam hits labeled as equivalog-equivalents by JCVI. Next, remove hits whose score is less than its ``family_trusted_cutoff`` (see `.pidginh`_). Take the name of the hit with the lowest e-value. If multiple hits have equivalent e-values, select the hit with the highest bit score.

If a ``dest_id`` has no hmmer hits deemed sutable for naming, examine the blast evidence (see `.pidginb`_ or `.blastm8`_), calculating the following terms:

::

    source_coverage = (source_stop - source_start + 1) / source_len
    dest_coverage = (dest_stop - dest_start + 1) / dest_len
    min_coverage = min(source_coverage, dest_coverage)

    source_pct_identity = num_identities / source_len
    dest_pct_identity = num_identities / dest_len
    min_pct_identity = min(source_pct_identity, dest_pct_identity)

    upper_pct_identity = max(min_pct_identity for all hits whose min_coverage ≥ 0.6)
    lower_pct_identity = max(0.5, upper_pct_identity - 0.05)

Cluster all hits associated with ``dest_id`` that have ``min_coverage`` ≥ 0.6 and whose ``min_pct_identity`` is between ``upper_pct_identity`` and ``lower_pct_identity`` (inclusive). If ``upper_pct_identity`` < ``lower_pct_identity``, ignore all hits.

If the cluster is not empty, and any of the hits in the cluster has a ``source_auth`` (see `.pidginb`_) of KEGG, then select the name from the one with the highest ``min_pct_identity``. If there are no hits from KEGG, proceed to SwissProt hits, then FIGfam and finally RefSeq, searching in each bin for the hit with the highest ``min_pct_identity`` within that bin.

Usage
-----

Given a series of data files, use the selection recipe described above to determine product names for the given genes.

::

    genepidgin select (options) [inputfiles]

    options:
      -o --output    : where to save files, defaults to ./pidgin_names.txt
      -e --etymology : where to save etymology (debug), defaults to ./pidgin_etymology.txt
      -h --help      : this information
      --use_custom_blast_cutoffs : use different cutoffs for different sources

The format of Input and Output files are described below.

Input
-----

Any number of input files following the following three formats are permitted. The ordering of the files, and the ordering of the lines within the files, does not matter. No tabs, newlines, or control characters are permitted in any of these fields.

``.pidginb``
~~~~~~~~~~~~

All files with the extension ``.pidginb`` are assumed to contain BLAST alignments.

Each line in a ``.pidginb`` file will consist of the following tab-separated fields:

::

    0.  dest_id STRING an identifier for a destination protein (i.e., a protein that
        should receive a name)
    1.  dest_start INTEGER 1-based index of first aligned amino acid in destination
        protein
    2.  dest_stop INTEGER 1-based index of last aligned amino acid in destination
        protein
    3.  dest_len INTEGER number of amino acids in destination protein
    4.  source_id STRING an identifier for a source protein (i.e., a protein whose
        name should be considered for assignment to the destination protein)
    5.  source_start INTEGER 1-based index of first aligned amino acid in source
        protein
    6.  source_stop INTEGER 1-based index of last aligned amino acid in source
        protein
    7.  source_len INTEGER number of amino acids in source protein
    8.  source_auth STRING the source of the data, used for heuristic processing,
        must be one of:
          - "FIGfam"
          - "KEGG"
          - "RefSeq"
          - "SwissProt"
    9.  num_identities INTEGER number of exact amino acid matches in alignment
    10. num_similarities INTEGER number of similar amino acid matches in alignment
    11. raw_name STRING the name of the source protein
    12. comment STRING can be used for any purpose

A sample line:

::

    7000002454063496        134     581     448     7000000120703332        127     596     470     FIGfam      151     227     FIG029094-5 IncW plasmid conjugative protein TrwB (TraD homolog)

``.pidginh``
~~~~~~~~~~~~

All files with the extension ``.pidginh`` are assumed to contain HMMER alignments.

Note: per-domain scores are ignored; we consider the whole hit only.

Each line in a ``.pidginh`` file will consist of the following tab-separated fields:

::

    0.  dest_id STRING an identifier for a destination protein (i.e., a protein that
        should receive a name)
    1.  dest_start INTEGER 1-based index of first aligned amino acid in destination
        protein
    2.  dest_stop INTEGER 1-based index of last aligned amino acid in destination
        protein
    3.  dest_len INTEGER number of amino acids in destination protein
    4.  source_id STRING an identifier for a source family (i.e., a profile whose
        name should be considered for assignment to the destination protein)
        currently should be a TIGRFAM or Pfam id.
    5.  source_start INTEGER 1-based index of first aligned position in source family
    6.  source_stop INTEGER 1-based index of last aligned position in source family
    7.  source_len INTEGER number of positions in source family
    8.  score FLOAT score reported by hmmer
    9.  family_trusted_cutoff FLOAT
    10. e_value FLOAT+INTEGER in the format X.XXeY where X.XX is a positive float and Y is an integer
    11. raw_name STRING the name of the source family
    12. comment STRING can be used for any purpose

A sample line:

::

    7000002454071269        3       140     138     TIGRfam 13      155     143     83.519997       80.000000   -21.585027      ribosomal-protein-alanine acetyltransferase

``.blastm8``
~~~~~~~~~~~~

Blast ``-m8`` format is also acceptable, but requires also submitting a name key via ``--ref``, as m8 format contains no names. This method also has much slower execution time.

It is assumed that all names derived from ``.blastm8`` have lower priority than other sources.

Output
------

The names of these files are governed by the option usage, as described above.

Names
~~~~~

Each line of the name file has four columns:

::

    0. dest_id STRING an identifier for a destination protein (i.e., a protein that should receive a name)
    1. name STRING the best available name for the destination protein
    2. source_id STRING the id of the blast or hmmer hit used to name this protein
    3. comment STRING the comment field from the line used to name this protein

A snippet from a names.txt from a development run:

::

    7000002454076078   fructose-1-6-bisphosphatase   FIGfam    run on library updated 2009/10/22
    7000002454076081   hypothetical protein          (blank)   (blank)

Note that hypothetical proteins don't have the final two fields, as they did not pick up a name from the given sources.

Etymology
~~~~~~~~~

The etymology file consists of a sequence of entries. Each entry describes the process by which the resulting name was given, showing tracking information as data is discarded and then summary information of how the name was cleaned up (plugs directly into :doc:`cleaner`) before it is presented.

Entries are separated by five equals signs and a newline: ``=====``

Each entry begins with the dest\_id alone on the first line of the block. Convenient for searching!

A snippet of a local run:

::

    7000002454076078
    1 hmmer source found.
    0 hmmer sources were removed due to not meeting the trusted family score.
    One hmmer source had a good name.
    Found an acceptable name in the hmmer sources. The one we liked best came from:
    ./test/Rho_sphaeroides_241_HMMERTRANSCRIPTS_17.pidginh:2013
    This source's name was cleaned up by genepidgin:
    filtered name in 1 step:
    0) original: Fructose-1-6-bisphosphatase
    1)   reason: protein names should not start with a capital letter
        pattern: (?:(?<=similar to )|^)([A-Z])(?=[a-z][a-z]+([ /,-]|$))
       filtered: fructose-1-6-bisphosphatase
    Final name: fructose-1-6-bisphosphatase
    =====
    7000002454076081
    0 hmmer sources found.
    No name was derived from hmmer sources.
    2 blast sources were found.
    0 blast sources were removed by filtering for low coverage (<0.6).
    The highest percent identity of any remaining blast source is 0.992. The lowest is 0.945.
    0 blast sources were removed due to not being within the percent identity window (0.992, 0.942).
    All 2 blast sources had names that filtered to nothing.
    No name was ultimately selected from any of the supplied sources.
    Final name: hypothetical protein

use_custom_blast_cutoffs
~~~~~~~~~~~~~~~~~~~~~~~~

As hardcoded in pidgin.select, widens the cutoff margin for blast hits (currently KEGG-only, check source for details).

.. note:: Please see :doc:`credits` for contributor information.