Genepidgin cleaner

Note: this logic in particular is from a different era of computes, and you’d almost certainly be better off with a homology-derived name against a tightly-governed protein library than a loose alignment against a less controlled one. Use GO.

Genepidgin cleaner standardizes the format of gene product names derived from diverse databases, including FIGfam, KEGG, Pfam, RefSeq, SwissProt and TIGRFAM. It’s the product of many years of production genome annotation.

This software package consists of a large collection of heuristics, formatting rules and regular expressions which are designed to take a name from any of Genepidgin’s supported databases and present it in a common style. Though our regexp library is large, it is not infinite; thus, Genepidgin cleaner cannot detect every possible name error. However, the vast majority of source names end up better and more informative for having gone through Genepidgin cleaner.

Goals

  • Names should agree with the prevailing conventions in cases where such conventions can be easily identified and agreed upon.
  • Names should be as clear and concise as possible.
  • Names should not be descriptive phrases that define function (for example, “protein involved in folding” is not useful, but “chaperonin” is).
  • Names should not include programmatic references.
  • Names should be derived from high-confidence alignments to homologous proteins. Names generated by Genepidgin, once deposited in public databases, may themselves be used as a basis to name other genes transitively. To prevent the propagation of incorrect product names, only high-confidence alignments should be used for naming.
  • Prefer no name or an obviously generic name to an uninformative name.
  • Prefer lowercase words in everything but acronyms and proper names.
  • Prefer a standardized expression of common protein names.
  • Prefer American English spelling.
  • Use only 7-bit ASCII characters, so that names render correctly on every computing platform.

Steps in Filtering Process

The following list is a rough description of the steps involved in processing a name. This list is not a literal description of the layout of the code, but rather a high-level overview of how Genepidgin cleaner works.

whole name filtering and deletion
Sometimes, names are published into the global protein namespace that are obviously the output of a malformed SQL query or accidentally copied Excel spreadsheet. We process these before doing anything else, extracting useful information when possible.
typo correction
People misspell (for example) hypothetical and transporter in many, many ways. Correcting these names early prevents later filters from missing human-obvious corrections.
uninformative clause removal
Subclauses that are globally uninformative are removed. For example, documented proteins should not have their functions described within their names, so phrases like “X involved with Y” simply become “X”.
clause replacement
The largest transformations happen here, where names are changed to become more consistent. For example, the phrase “transport family protein” becomes “transporter”.
organism names
The vast majority of the time, specific organism names are not informative when copied across species by homology or alignment. We remove them.
id removal
Many published genes have obvious database ids. Genepidgin does not transitively assign these to new gene annotations.
punctuation cleanup
Removing ids and other phrases often leaves bad punctuation and/or leftover parentheses, which then must be themselves removed.
standardize format
The grammatical structures of product names are improved late in the cleaner process. This category assumes that by this point, the name is a keeper, and simply reformats it for consistent presentation.
final sanity check
If, after filtering, the entire name is otherwise uninformative such as “CDS” or “small secreted protein”, then the name is misleading and will be dropped.
capitalization
Finally, Genepidgin tries to establish consistent capitalization: only proper names and acronyms are capitalized.

How to Use Genepidgin cleaner

All files used as input and output are in the Simple Name File Format.

via the command-line

  • ``cleaner`` takes a name and applies the full list of filters to it. A name can be filtered to an empty string by this function; the output of the command will tell you why. Names that are filtered to nothing are ones Genepidgin considers to be uninformative.
$ genepidgin cleaner <inputfile>

Setting the -d flag indicates that genepidgin should return a default name ("hypothetical protein") when a name would otherwise be blank.

usage doc

$ genepidgin cleaner -h

usage: cmdline.py cleaner [-h] [--silent] [--default] input output

positional arguments:
  input          filename with names to clean
  output         output file

optional arguments:
  -h, --help     show this help message and exit
  --silent, -s   display etymology to stdout during compute
  --default, -d  return default name "hypothetical project" when names filter
                 to nothing, else return emptry string

via Python

From inside your python shell, let’s set up your first test case.

>>> import pidgin.cleaner
>>> bname = pidgin.cleaner.BioName()
>>> name = "BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]"

Instatiating BioName compiles a couple hundred regular expressions. Instantiating a new BioName object for every name to be changed can get expensive. A single BioName object can reformat any number of names, so callers need only instantiate the class once.

Under the hood, cleaner calls on either filter or cleanup. When everyone had different default names, this distinction was more meaningful, but now everyone follows UNIPROT’s hypothetical protein standard.

This name contains a great deal of spurious and unreliable information. A quick cleaner of this name...

>>> cleaned = bname.filter(name)
>>> print cleaned
"glycine/betaine/L-proline ABC transporter"

To see what happened during the filter process, we set getOutput to true when we call filter. Note the additional returned value.

>>> (cleaned, process_string) = bname.filter(name, getOutput=1)
>>> print cleaned
"glycine/betaine/L-proline ABC transporter"
>>> print process_string
filtered name in 5 steps:
0) original: BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]
1)   reason: transport protein -> transporter
    pattern: \btransport(er)?\s+protein\b
   filtered: BT002689 glycine/betaine/L-proline ABC transporter, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]
2)   reason: id
    pattern: \b[A-Za-z0-9]+\d{4,}(?<!\b(?:DUF|UPF)\d{4})\b(?!\s*(kD(a)?|-like|family|protein\s+family))
   filtered:  glycine/betaine/L-proline ABC transporter, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]
3)   reason: delete spaces at beginning of name
    pattern: ^\s+
   filtered: glycine/betaine/L-proline ABC transporter, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]
4)   reason: delete closing brackets at end of name
    pattern: (?:\[[^]]*)\]\s*$
   filtered: glycine/betaine/L-proline ABC transporter, periplasmic-binding protein
5)   reason: delete notes after commas, dashes, semicolon--except when followed by family or superfamily
    pattern: [-,;]\s+(?!family)(?!superfamily).*
   filtered: glycine/betaine/L-proline ABC transporter

(Note that process_string is a single multiline string, which looks good when print‘ed but bad when simply exported.)

Reference the documentation in the code for more information on parameters. It’s fairly well commented, if not clear.

Note

Please see Credits for contributor information.

Project Versions

Table Of Contents

Previous topic

Installation

Next topic

Genepidgin compare

This Page