Genepidgin *cleaner* ==================== **Note**: this logic in particular is from a different era of computes, and you'd almost certainly be better off with a homology-derived name against a tightly-governed protein library than a loose alignment against a less controlled one. Use GO. Genepidgin *cleaner* standardizes the format of gene product names derived from diverse databases, including FIGfam, KEGG, Pfam, RefSeq, SwissProt and TIGRFAM. It's the product of many years of production genome annotation. This software package consists of a large collection of heuristics, formatting rules and regular expressions which are designed to take a name from any of Genepidgin's supported databases and present it in a common style. Though our regexp library is large, it is not infinite; thus, Genepidgin *cleaner* cannot detect every possible name error. However, the vast majority of source names end up better and more informative for having gone through Genepidgin *cleaner*. Goals ----- - Names should agree with the prevailing conventions in cases where such conventions can be easily identified and agreed upon. - Names should be as clear and concise as possible. - Names should not be descriptive phrases that define function (for example, "protein involved in folding" is not useful, but "chaperonin" is). - Names should not include programmatic references. - Names should be derived from high-confidence alignments to homologous proteins. Names generated by Genepidgin, once deposited in public databases, may themselves be used as a basis to name other genes transitively. To prevent the propagation of incorrect product names, only high-confidence alignments should be used for naming. - Prefer no name or an obviously generic name to an uninformative name. - Prefer lowercase words in everything but acronyms and proper names. - Prefer a standardized expression of common protein names. - Prefer American English spelling. - Use only 7-bit ASCII characters, so that names render correctly on every computing platform. Steps in Filtering Process -------------------------- The following list is a rough description of the steps involved in processing a name. This list is not a literal description of the layout of the code, but rather a high-level overview of how Genepidgin *cleaner* works. whole name filtering and deletion Sometimes, names are published into the global protein namespace that are obviously the output of a malformed SQL query or accidentally copied Excel spreadsheet. We process these before doing anything else, extracting useful information when possible. typo correction People misspell (for example) *hypothetical* and *transporter* in many, many ways. Correcting these names early prevents later filters from missing human-obvious corrections. uninformative clause removal Subclauses that are globally uninformative are removed. For example, documented proteins should not have their functions described within their names, so phrases like "X involved with Y" simply become "X". clause replacement The largest transformations happen here, where names are changed to become more consistent. For example, the phrase "transport family protein" becomes "transporter". organism names The vast majority of the time, specific organism names are not informative when copied across species by homology or alignment. We remove them. id removal Many published genes have obvious database ids. Genepidgin does not transitively assign these to new gene annotations. punctuation cleanup Removing ids and other phrases often leaves bad punctuation and/or leftover parentheses, which then must be themselves removed. standardize format The grammatical structures of product names are improved late in the cleaner process. This category assumes that by this point, the name is a keeper, and simply reformats it for consistent presentation. final sanity check If, after filtering, the entire name is otherwise uninformative such as "CDS" or "small secreted protein", then the name is misleading and will be dropped. capitalization Finally, Genepidgin tries to establish consistent capitalization: only proper names and acronyms are capitalized. How to Use Genepidgin *cleaner* --------------------------- All files used as input and output are in the `Simple Name File Format <#simple>`_. via the command-line ~~~~~~~~~~~~~~~~~~~~ - **``cleaner``** takes a name and applies the full list of filters to it. A name can be filtered to an empty string by this function; the output of the command will tell you why. Names that are filtered to nothing are ones Genepidgin considers to be uninformative. :: $ genepidgin cleaner Setting the -d flag indicates that genepidgin should return a default name (``"hypothetical protein"``) when a name would otherwise be blank. usage doc ^^^^^^^^^ :: $ genepidgin cleaner -h usage: cmdline.py cleaner [-h] [--silent] [--default] input output positional arguments: input filename with names to clean output output file optional arguments: -h, --help show this help message and exit --silent, -s display etymology to stdout during compute --default, -d return default name "hypothetical project" when names filter to nothing, else return emptry string via Python ~~~~~~~~~~ From inside your python shell, let's set up your first test case. :: >>> import pidgin.cleaner >>> bname = pidgin.cleaner.BioName() >>> name = "BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]" Instatiating ``BioName`` compiles a couple hundred regular expressions. Instantiating a new ``BioName`` object for every name to be changed can get expensive. A single ``BioName`` object can reformat any number of names, so callers need only instantiate the class once. Under the hood, ``cleaner`` calls on either ``filter`` or ``cleanup``. When everyone had different default names, this distinction was more meaningful, but now everyone follows UNIPROT's *hypothetical protein* standard. This name contains a great deal of spurious and unreliable information. A quick ``cleaner`` of this name... :: >>> cleaned = bname.filter(name) >>> print cleaned "glycine/betaine/L-proline ABC transporter" To see what happened during the filter process, we set ``getOutput`` to true when we call ``filter``. Note the additional returned value. :: >>> (cleaned, process_string) = bname.filter(name, getOutput=1) >>> print cleaned "glycine/betaine/L-proline ABC transporter" >>> print process_string filtered name in 5 steps: 0) original: BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20] 1) reason: transport protein -> transporter pattern: \btransport(er)?\s+protein\b filtered: BT002689 glycine/betaine/L-proline ABC transporter, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20] 2) reason: id pattern: \b[A-Za-z0-9]+\d{4,}(?