World's most popular travel blog for travel bloggers.

Finding interesting anagrams

, , No Comments
Problem Detail: 

Say that $a_1a_2\ldots a_n$ and $b_1b_2\ldots b_n$ are two strings of the same length. An anagramming of two strings is a bijective mapping $p:[1\ldots n]\to[1\ldots n]$ such that $a_i = b_{p(i)}$ for each $i$.

There might be more than one anagramming for the same pair of strings. For example, If $a=$`abcab` and $b=$cabab we have $p_1[1,2,3,4,5]\to[4,5,1,2,3]$ and $p_2[1,2,3,4,5] \to [2,5,1,4,3]$, among others.

We'll say that the weight $w(p)$ of an anagramming $p$ is the number of cuts one must make in the first string to get chunks that can be rearranged to obtain the second string. Formally, this the number of values of $i\in[1\ldots n-1]$ for which $p(i)+1\ne p(i+1)$. That is, it is the number of points at which $p$ does not increase by exactly 1.For example, $w(p_1) = 1$ and $w(p_2) = 4$, because $p_1$ cuts 12345 once, into the chunks 123 and 45, and $p_2$ cuts 12345 four times, into five chunks.

Suppose there exists an anagramming for two strings $a$ and $b$. Then at least one anagramming must have least weight. Let's say this this one is lightest. (There might be multiple lightest anagrammings; I don't care because I am interested only in the weights.)

Question

I want an algorithm which, given two strings for which an anagramming exists, efficiently yields the exact weight of the lightest anagramming of the two strings. It is all right if the algorithm also yields a lightest anagramming, but it need not.

It is a fairly simple matter to generate all anagrammings and weigh them, but there may be many, so I would prefer a method that finds light anagrammings directly.


Motivation

The reason this problem is of interest is as follows. It is very easy to make the computer search the dictionary and find anagrams, pairs of words that contain exactly the same letters. But many of the anagrams produced are uninteresting. For instance, the longest examples to be found in Webster's Second International Dictionary are:

cholecystoduodenostomy
duodenocholecystostomy

The problem should be clear: these are uninteresting because they admit a very light anagramming that simply exchanges the cholecysto, duedeno, and stomy sections, for a weight of 2. On the other hand, this much shorter example is much more surprising and interesting:

coastline
sectional

Here the lightest anagramming has weight 8.

I have a program that uses this method to locate interesting anagrams, namely those for which all anagrammings are of high weight. But it does this by generating and weighing all possible anagrammings, which is slow.

Asked By : Mark Dominus

Answered By : Tsuyoshi Ito

This problem is known as the "minimum common string partition problem." (More precisely, the answer in the minimum common string partition problem equals the answer in your problem plus 1.) Unfortunately, it is NP-hard, even with the restriction that each letter occurs at most twice in each of the input strings, as is proved by Goldstein, Kilman, and Zheng [GKZ05]. This means that no polynomial-time algorithm exists unless P=NP. (Of course, if each letter occurs at most once, then the problem is trivial because there is only one anagramming.)

On the positive side, the same authors [GKZ05] give a polynomial-time 1.1037-approximation algorithm under the same restriction. (A "1.1037-approximation algorithm" means an algorithm which may not output the correct answer A but is guaranteed to output a value B such that AB ≤ 1.1037A.) They also give a linear-time 4-approximation algorithm under the weaker restriction that each letter occurs at most three times in each of the input strings.

[GKZ05] Avraham Goldstein, Petr Kolman, and Jie Zheng. Minimum common string partition problem: Hardness and approximations. Electronic Journal of Combinatorics, 12, article R50, 2005. http://www.combinatorics.org/ojs/index.php/eljc/article/view/v12i1r50

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/2259

0 comments:

Post a Comment

Let us know your responses and feedback