Note: this logic in particular is from a different era of computes, and you’d almost certainly be better off with a homology-derived name against a tightly-governed protein library than a loose alignment against a less controlled one. Use GO.
Genepidgin cleaner standardizes the format of gene product names derived from diverse databases, including FIGfam, KEGG, Pfam, RefSeq, SwissProt and TIGRFAM. It’s the product of many years of production genome annotation.
This software package consists of a large collection of heuristics, formatting rules and regular expressions which are designed to take a name from any of Genepidgin’s supported databases and present it in a common style. Though our regexp library is large, it is not infinite; thus, Genepidgin cleaner cannot detect every possible name error. However, the vast majority of source names end up better and more informative for having gone through Genepidgin cleaner.
The following list is a rough description of the steps involved in processing a name. This list is not a literal description of the layout of the code, but rather a high-level overview of how Genepidgin cleaner works.
All files used as input and output are in the Simple Name File Format.
$ genepidgin cleaner <inputfile>
Setting the -d flag indicates that genepidgin should return a default name ("hypothetical protein") when a name would otherwise be blank.
$ genepidgin cleaner -h
usage: cmdline.py cleaner [-h] [--silent] [--default] input output
positional arguments:
input filename with names to clean
output output file
optional arguments:
-h, --help show this help message and exit
--silent, -s display etymology to stdout during compute
--default, -d return default name "hypothetical project" when names filter
to nothing, else return emptry string
From inside your python shell, let’s set up your first test case.
>>> import pidgin.cleaner
>>> bname = pidgin.cleaner.BioName()
>>> name = "BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]"
Instatiating BioName compiles a couple hundred regular expressions. Instantiating a new BioName object for every name to be changed can get expensive. A single BioName object can reformat any number of names, so callers need only instantiate the class once.
Under the hood, cleaner calls on either filter or cleanup. When everyone had different default names, this distinction was more meaningful, but now everyone follows UNIPROT’s hypothetical protein standard.
This name contains a great deal of spurious and unreliable information. A quick cleaner of this name...
>>> cleaned = bname.filter(name)
>>> print cleaned
"glycine/betaine/L-proline ABC transporter"
To see what happened during the filter process, we set getOutput to true when we call filter. Note the additional returned value.
>>> (cleaned, process_string) = bname.filter(name, getOutput=1)
>>> print cleaned
"glycine/betaine/L-proline ABC transporter"
>>> print process_string
filtered name in 5 steps:
0) original: BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]
1) reason: transport protein -> transporter
pattern: \btransport(er)?\s+protein\b
filtered: BT002689 glycine/betaine/L-proline ABC transporter, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]
2) reason: id
pattern: \b[A-Za-z0-9]+\d{4,}(?<!\b(?:DUF|UPF)\d{4})\b(?!\s*(kD(a)?|-like|family|protein\s+family))
filtered: glycine/betaine/L-proline ABC transporter, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]
3) reason: delete spaces at beginning of name
pattern: ^\s+
filtered: glycine/betaine/L-proline ABC transporter, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]
4) reason: delete closing brackets at end of name
pattern: (?:\[[^]]*)\]\s*$
filtered: glycine/betaine/L-proline ABC transporter, periplasmic-binding protein
5) reason: delete notes after commas, dashes, semicolon--except when followed by family or superfamily
pattern: [-,;]\s+(?!family)(?!superfamily).*
filtered: glycine/betaine/L-proline ABC transporter
(Note that process_string is a single multiline string, which looks good when print‘ed but bad when simply exported.)
Reference the documentation in the code for more information on parameters. It’s fairly well commented, if not clear.
Note
Please see Credits for contributor information.