Genepidgin compare

Genepidgin compare uses a combination of edit distance and longest-common-substring calculations to estimate the degree of similarity between two or more protein names.


To compare two names, we

  1. decompose each name into tokens,
  2. remove uninformative tokens,
  3. rearrange the tokens in such a way as to... - minimize the edit distance between them, and - maximize the length of common token substrings
  4. report a single number between 0 and 1 (inclusive) summarizing the distance between the two names.

In more detail:

  1. decompose each name into tokens

    First, we split the names up by spaces, remove EC numbers and punctuation and other sorts of extra characters, convert everything to lowercase, etc.

    in: “Ribosomal protein, S23-type” out: “ribosomal” · “protein” · “s23-type”

  2. remove uninformative tokens

    In this step we strike out words that are only useful in a grammatical sense, including an, and, in, is, of, the, etc. We also remove weasel words, such as generic, hypothetical, related, etc. Finally, we remove glue words, such as associated, class, component, protein, system, and type. When these words are stripped we are left with a “core” name that identifies the protein; different namers may use different glue words to format the core name and we ignore those.

    in: “ribosomal” · “protein” · “s23-type” out: “ribosomal” · “s23”

    Because we strip out noninformative tokens, we count all of the following strings as equal.

    • “predicted protein”
    • “putative protein”
    • “hypothetical protein”
    • “conserved hypothetical protein”
  3. rearrange the tokens in such a way as to ...

    Finding the best edit distance between two names of, say, 4 tokens each is a bit tricky, because it’s possible that the lowest cumulative edit distance will involve one or more sub-optimal individual token matches. In fact there are cases where the lowest distance is composed entirely of sub-optimal token pairings. So we need to try a lot of combinations. To do this we precompute two scores for each pair of tokens, and build two n × n matrices to hold them. We then score all possible paths with distinct pairwise token pairings via these matrices. For each path we combine two scores: we try to minimize the normalized edit distance between token pairs, and we try to maximize the length of the longest pairwise common substrings between pairs of tokens.

    In one matrix, we store the pairwise token-token edit distance, using the Damerau-Levenshtein distance, leveraging the excellent Python implementation by Michael Homer. We normalize the edit distance by dividing it by the number of characters in the longer token. The other *n* × *n* matrix holds the length of the longest common substring between each pair of tokens. Our LCS finder is similar to that published on the Wikipedia.

    In the case where the protein names have different numbers of tokens, we build square matrices from the largest dimension, padding the shorter dimension with empty tokens. There also are heuristics to handle cases where a token in one name is composed of two or more tokens in the other. The special handling for these special cases is too detailed for this document; see the source or contact the authors for details.

    Note that token order has no effect on the distance between two names.

  4. report a single number between 0 and 1 (inclusive) summarizing the distance between the two names.

    A perfect token-token match is really good. A lot of perfect matches are really, really good. Long common substrings are fairly good. The Damerau-Levenshtein distance can return higher distances than we might like for these three types of token matches. On the other hand, maximizing the length of the longest common substring(s) has its own set of problems. After a great deal of trial and error, we have settled on the following equation, which has worked well on genome-scale scoring studies across a variety of prokaryotes.

    "Genepidgin" distance =
        SUM(per-token normalized edit distance) *
        (1 - (SUM(per-token LCS length) / LENGTH(longer name))) *
        (1 / COUNT(compared tokens))

    The first line of this distance metric weights each pair of tokens equally. Thus a “SecG” · “SecG” match counts just as much as a “phosphoribosylglycinamide” · “phosphoribosylglycinamide” match.

    The second line of the metric weights each character equally, thereby lowering the distances between long tokens that differ only slightly, for example

    2,3,4,5-tetrahydropyridine-2,6-dicarboxylate 2,3,4,5-tetrahydropyridine-2-carboxylate

    The third line of the distance metric above simply normalizes the score from 0 to 1. A distance of 0 indicates the names have identical information content and are essentially equivalent. A distance of 1 indicates the names have nothing in common.

How to use Genepidgin compare

Given at least two input files, one reference and one or more queries, score the distance (using genepidgin.distance.DistanceTool()) between the names found in the files.

genepidgin compare (options) <reference_file> <query_file> [<query_file2> ...]

  --help: this information

All input files must be in the Simple Name File Format.

This tool will create one output file per query file. The per-query output file(s) will have name(s) of the form <query_file>.compared.

If there are multiple query files, a summary file containing the closest query match for each reference name will also be created. The summary file will be named <reference_file>.summary.

Each line in the two-way comparison result will consist of the following tab-separated fields:

0.  ID. This is the string from the first field of the entry from the reference file.
1.  Score. The distance between the two names.
2.  Reference name. The reference name used for the comparison.
3.  Query name. The query name used for the comparison.

If a summary file is generated, each line in that file will consist of the following tab-separated fields:

0.  ID. This is the string from the first field of the entry from the reference file.
1.  Score. The distance between the two names.
2.  Reference name. The reference name used for the comparison.
3.  Best query name. The best matching query name.  In cases where multiple query names scored identically, the first name with that score will appear here. (This will typically only happen for completely dissimilar names)
4.  Best query source. The basename of the file which held the best query name. (ex: query_file1) In cases where multiple query names scored identically, multiple basenames will be present in this column, separated by semicolons. (ex: query_file1;query_file2)

Results are presented in the same order as in the input reference file. Names in query files that correspond to an ID not present in the reference file will be ignored. Names in the reference file with no corresponding query are scored as a complete miss (1.0). Input query and reference files may reside in any directory, but no two files may have the same basename.

Genepidgin compare Score Range

The distribution in accuracy is not linear between 0.0 and 1.0; that is, after a certain level of dissimilarity it doesn’t matter how much more dissimilar two names are.

The following table presents a quick guide to the interpretation of distance scores.

score likelihood of functional match
=0.0 functionally identical
0.0 - 0.1 excellent match
0.1 - 0.3 good match
0.3 - 0.5 possibly similar, with potentially significant distances
0.5 - 1.0 not generally useful
=1.0 completely different

There is support for using the output of Genepidgin compare directly within Python; consult genepidgin/ for details.


Please see Credits for contributor information.

Project Versions

Table Of Contents

Previous topic

Genepidgin cleaner

Next topic

Genepidgin select

This Page