World's most popular travel blog for travel bloggers.

# Length of an Edit Script and the Number of Deletions

, ,
Problem Detail:

I am reading the 1989 paper called "An O(NP) Sequence Comparison Algorithm" by Wu, Manber, Myers, and Miller. The algorithm sounds like a good fit for a project I'm doing at work. I have found some implementations in my target language that I could reuse, but I want to make sure that I understand the algorithm (and the code) because it is so crucial to my project.

I am a ways into the paper, and seem to be understanding it, but there's something important in the abstract that still doesn't make sense to me. This is the relevant portion:

Let $A$ and $B$ be two sequences of length $M$ and $N$ respectively, where without loss of generality $N \ge M$, and let $D$ be the length of a shortest edit script between them. A parameter related to $D$ is the number of deletions in such a script, $P = D/2 - (N - M)/2$. We present an algorithm...

This relationship between $P$, $D$, $N$, and $M$ is not proven in the paper. And when a try using it with a simple example (turning "xy" into "x", for instance), I get a nonsensical answer. Can someone please explain the relationship?

This is a straightforward consequence of the fact that the edit script only contains insert and delete operations. It follows from a little bit of basic arithmetic -- it's nothing especially deep.

Consider a script that contains $I$ insert operations and $P$ delete operations. Then the total number of operations in the edit script is $I+P$. If we're given that the total number of operations is $D$, we know $D=I+P$. Also we know $M+I-P=N$ (since each insert operation increases the length of the document, and each delete decreases it). Solving these two simultaneous equations yields the relationship you list.