Edit Distance
In spell checking and gene matching, one needs to find strings in a dictionary that are close matches to a given string. We need to determine an appropriate notion of "closeness". A natural measure of the distance between two strings is the extent to which they can be aligned. We can display this notion of alignment by writing one string on top of the other. For instance if we take the two strings SNOWY and SUNNY and look at possible alignments we have two possible arrangements depicted below.
S _ N O W Y _ S N O W _ Y
S U N N _ Y S U N _ _ _ Y
Cost = 3 Cost = 5
The "_" indicates a gap; and any number of these can be placed in either string. The cost of an alignment is the number of columns in which the letters differ. The minimum cost of changing SNOWY into SUNNY is shown to be 3; and this cost will be referred to as the edit distance between these two strings. Converting SNOWY into SUNNY involves three edit operations on SNOWY: inserting a U after the first S, converting an O into an N, and deleting a W. In our example the cost of each of these edit operation was taken to be 1, however, it is not necessary to treat each of these operations equally. One may assign a higher cost to an insert or a delete operation than to a substitution since inserts and deletes involve moving other letters in the string.
In general, there are many different alignments between two strings and we need to find an efficient way of searching through all of them to find the best one. We will use a dynamic programming approach to solve this problem. Dynamic programming is used in problems that can be formulated in terms of their sub-problems, and where these sub-problems are overlapping -- the same sub-problem has to be solved repeatedly.
What are the sub-problems of the edit distance problem? If we are given two strings x = [x1, x2, …, xm] and y = [y1, y2, …, yn], we will let E[m, n] denote the minimum edit distance between them. If we consider the last two characters xm and yn and try to align them, we have the following three cases: the last two characters will be aligned (and either match or can be changed by substitution at a cost of d(xm, yn) ), a "_" will be placed at the end of string x and it will be shifted left to be matched with the rest of string y, or a "_" will be placed at the end of string y and it will be shifted left to be compared with string x. if we assign a unit cost to each of the three edit operations, we can write
E(m, n) = min(E(m-1, n) + 1, E(m, n-1) + 1, E(m-1, n-1) + diff(xm, yn))
where diff(xm, yn) = 0 if xm = yn, and 1 if xm ¹ yn
The problem of finding the minimum edit distance between two strings of length m and n is now formulated in terms of finding the minimum edit distance for each of the three subproblems.
The dynamic programming solution to this problem is to create a table of size (m + 1) by (n + 1) -- to include the empty sub-strings. The cost of converting an empty string into a string of length i is the cost of doing i inserts, and similarly the cost of converting a string of length j into an empty string is the cost of j deletes. In the example below we are converting EXPONENTIAL into POLYNOMIAL and we begin by constructing the following table.
|
|
|
P |
O |
L |
Y |
N |
O |
M |
I |
A |
L |
|
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|
E |
1 |
|
|
|
|
|
|
|
|
|
|
|
X |
2 |
|
|
|
|
|
|
|
|
|
|
|
P |
3 |
|
|
|
|
|
|
|
|
|
|
|
O |
4 |
|
|
|
|
|
|
|
|
|
|
|
N |
5 |
|
|
|
|
|
|
|
|
|
|
|
E |
6 |
|
|
|
|
|
|
|
|
|
|
|
N |
7 |
|
|
|
|
|
|
|
|
|
|
|
T |
8 |
|
|
|
|
|
|
|
|
|
|
|
I |
9 |
|
|
|
|
|
|
|
|
|
|
|
A |
10 |
|
|
|
|
|
|
|
|
|
|
|
L |
11 |
|
|
|
|
|
|
|
|
|
|
In the fashion of dynamic programming, we will fill in the rest of this table from top to bottom, left to right. To fill in the entry for the cell Ei,j we need only look at the three cells Ei-1,j, Ei, j-1, and Ei-1,j-1.
The body of the algorithm for edit distance becomes:
for (int i = 0; i <= m; i++)
E[i][0] = i;
for (int j = 0; j <= n; j++)
E[0][j] = j;
for (int i = 1; i <= m; i++)
for (int j = 1; j <= n; j++)
E[i][j] = Math.min(E[i-1][j] + 1, E[i][j-1] + 1, E[i-1][j-1] + diff(i, j) );
return E[m][n];
Using this algorithm to complete the table we obtain 6 as the lowest cost to convert EXPONENTIAL into POLYNOMIAL.
|
|
|
P |
O |
L |
Y |
N |
O |
M |
I |
A |
L |
|
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|
E |
1 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|
X |
2 |
2 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|
P |
3 |
2 |
3 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|
O |
4 |
3 |
2 |
3 |
4 |
5 |
5 |
6 |
7 |
8 |
9 |
|
N |
5 |
4 |
3 |
3 |
4 |
4 |
5 |
6 |
7 |
8 |
9 |
|
E |
6 |
5 |
4 |
4 |
4 |
5 |
5 |
6 |
7 |
8 |
9 |
|
N |
7 |
6 |
5 |
5 |
5 |
4 |
5 |
6 |
7 |
8 |
9 |
|
T |
8 |
7 |
6 |
6 |
6 |
5 |
5 |
6 |
7 |
8 |
9 |
|
I |
9 |
8 |
7 |
7 |
7 |
6 |
6 |
6 |
6 |
7 |
8 |
|
A |
10 |
9 |
8 |
8 |
8 |
7 |
7 |
7 |
7 |
6 |
7 |
|
L |
11 |
10 |
9 |
8 |
9 |
8 |
8 |
8 |
8 |
7 |
6 |
To retrieve the minimum cost matching we need to keep a second table, or a second field within each of the cells of a single table, to indicate the direction of the previous cell that was used to determine the value placed in the ijth cell. In the event that neighboring cells have the same value, extend along the diagonal first and to the right as the second tie breaker (as done in the code that follows). To retrieve the minimum cost alignment, start at the bottom-right cell and follow the direction of the arrows to the top left corner (marked 0). The path traversed by the minimum cost alignment in this example is bolded in the table below (and indicated with double diagonals). Everywhere a ¬ appears in the path, a "_" is placed in the string labeling the rows at that position; Everywhere a appears in the path, a "_" is placed at that location in the string that labels the columns of the table; and everywhere a diagonal arrow appears in the path, the letters either match, or a substitution is made in the target string (POLYNOMIAL). The code for the edit distance algorithm now becomes
public static cell[][] editDistance(cell[][] E, String x, String y) {
for (int i = 0; i <= x.length; i++) {
E[i][0].cost = i;
E[i][0].dir = 'h'; //directions will be {'h', 'v', 'd'}
}
for (int j = 0; j <= y.length; j++) {
E[0][j].cost = j;
E[0][j].dir = 'v';
}
for (i = 1; i <= x.length; i++)
for (j = 1; j <= y.length; j++) {
if (x.charAt[i - 1] == y.charAt[j - 1]) {
E[i][j].cost = E[i - 1][j - 1].cost;
E[i][j].dir = 'd';
}
else if (E[i-1][j-1].cost <= E[i-1][j].cost &&
E[i-1]-j-1].cost <= E[i][j-1].cost) {
E[i][j].cost = E[i-1[j-1].cost + 1;
E[i][j].dir = 'd';
}
else if (E[i][j-1].cost <= E[i-1[j].cost) {
E[i][j].cost = E[i][j-1].cost + 1;
E[i][j].dir = 'h';
}
else {
E[i][j].cost = E[i-1][j].cost + 1;
E[i][j].dir = 'v';
}
}
return E;
}
|
|
|
P |
O |
L |
Y |
N |
O |
M |
I |
A |
L |
|
|
0 |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
|
E |
| |
\ |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
|
X |
| |
\ |
\ |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
|
P |
| |
\\ |
\ |
\ |
\ |
\ |
\ |
\ |
\ |
\ |
\ |
|
O |
| |
\ |
\\ |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
¬ |
|
N |
| |
| |
| |
\\ |
\ |
\ |
¬ |
¬ |
¬ |
¬ |
¬ |
|
E |
| |
| |
| |
\ |
\\ |
\ |
\ |
\ |
\ |
\ |
\ |
|
N |
| |
| |
| |
| |
\ |
\\ |
¬ |
\ |
\ |
\ |
\ |
|
T |
| |
| |
| |
\ |
\ |
| |
¬ |
\\ |
\ |
\ |
\ |
|
I |
| |
| |
| |
\ |
\ |
\ |
\ |
\ |
\\ |
¬ |
¬ |
|
A |
| |
| |
| |
\ |
\ |
\ |
\ |
\ |
\ |
\\ |
¬ |
|
L |
| |
| |
| |
\ |
\ |
| |
\ |
\ |
\ |
| |
\\ |
The resulting best alignment match from this table is:
E
X P O N E N
_ T I
A L
_
_ P O L Y N
O M I
A L
In this example, all of the edit operations of substitute, insert, and delete, were given an equal cost of 1. Other weight could have been assigned to each operation, and in general the assignment of weight could be expressed as a matrix, consisting of (|S| + 1) x (|S| + 1) rows and columns where |S| is the number of characters in the character set from which the strings are formed. If we let xi and xj be any two characters in the set S, then the cost matrix entries would consist of the following weights -- w(xi,xj) if xi == xj; w(xi,xj) if xi != xj; w(xi,_); and w(_,xi).
The above code would need to be slightly modified to retrieve and include the weight value in each of the conditional statements.