16.2.3. Closest String

The final example we consider before moving on to more advanced branching rules for vertex cover, independent set, and cluster editing is the closest string problem. This problem is interesting because the number of calls made by an invocation of the algorithm discussed here is not constant: It depends on the parameter \(k\).

Given two strings \(s_1 = x_{1,1} x_{1,2} \cdots x_{1,n}\) and \(s_2 = x_{2,1} x_{2,2} \cdots x_{2,n}\) of length \(n\), the Hamming distance \(d(s_1, s_2)\) between \(s_1\) and \(s_2\) is the number of positions where \(s_1\) and \(s_2\) differ:

\[d(s_1, s_2) = |\{ 1 \le i \le n \mid x_{1,i} \ne x_{2,i} \}|.\]

Closest String Problem: Given a set \(S = \{s_1, \ldots, s_t\}\) of strings of length \(n\), find a string \(s\) that minimizes

\[d_{\max}(s, S) = \max_{1 \le i \le t} d(s, s_i).\]

As a parameterized version, given an instance \((S, k)\), we want to decide whether there exists a string \(s\) such that \(d_{\max}(s, S) \le k\). This problem has applications in bioinformatics, where it is also often referred to as the consensus string problem. Thus, we refer to \(s\) as the consensus string of \(s_1, \ldots, s_t\) from here on.

While the branching algorithm we develop for this problem is still fairly elementary, the analysis required to obtain an effective branching rule is more complicated than for the problems we studied so far. If there exists a consensus string \(s\) with \(d_{\max}(s, S) \le k\), then, by definition, \(d(s, s_1) \le k\). Thus, \(s\) can be obtained by replacing at most \(k\) characters in \(s_1\). We replace the input instance \((S, k)\) with an instance \((s_1, S_2, k, k)\), where \(S_2 = \{s_2, \ldots, s_t\}\). An instance \((s_1, S_2, k_1, k_2)\) is a yes-instance if and only if there exists a string \(s\) such that \(d(s, s_1) \le k_1\) and \(d(s, s_i) \le k_2\) for all \(s_i \in S_2\). In particular, \((s_1, S_2, k, k)\) is a yes-instance if and only if \((S, k)\) is a yes-instance. We use the following algorithm CS to decide whether a given instance \((s_1, S_2, k_1, k_2)\) is a yes-instance and, if so, find a string \(s\) such that \(d(s, s_1) \le k_1\) and \(d(s, s_i) \le k_2\) for all \(s_i \in S_2\).

\(\boldsymbol{\textbf{CS}(s_1, S_2, k_1, k_2)}\):

  • If \(k_1 < 0\), then \((s_1, S_2, k_1, k_2)\) is clearly a no-instance: There is no string \(s\) with \(d(s,s_1) < 0\). Thus, we return "No".

  • If \(k_1 \ge 0\) and \(d(s_1, s_i) \le k_2\) for all \(s_i \in S_2\), then the string \(s = s_1\) satisfies \(d(s,s_1) = 0 \le k_1\) and \(d(s,s_i) \le k_2\) for all \(s_i \in S_2\). Thus, \((s_1, S_2, k_1, k_2)\) is a yes-instance and we return \(s = s_1\) as the desired consensus string.

  • If there exists a string \(s_i \in S_2\) with \(d(s_1, s_i) > k_2\), then we choose an arbitrary set \(D\) of \(k_2+1\) positions where \(s_1\) and \(s_i\) differ, that is, \(|D| = k_2+1\) and

    \[D \subseteq \{ j \in \{1, \ldots, n\} \mid x_{1,j} \ne x_{i,j} \}.\]

    If a consensus string \(s\) satisfies \(d(s, s_i) \le k_2\), then \(s\) and \(s_i\) must agree in at least one of these \(k_2+1\) positions. Thus, since our goal is to obtain \(s\) from \(s_1\), we must transform \(s_1\) by replacing \(x_{1,j}\) with \(x_{i,j}\) for some \(j \in D\) and then making up to \(k_1-1\) additional edits. This leads to the following branching rule: Let \(s_1^{(j)}\) be the string obtained from \(s_1\) by substituting \(x_{1,j}\) with \(x_{i,j}\):

    \[s_1^{(j)} = x_{1,1}x_{1,2}\cdots x_{1,j-1}x_{i,j}x_{1,j+1}x_{1,j+2}\cdots x_{1,n}.\]

    Then make \(k_2+1\) recursive calls \(\boldsymbol{\textbf{CS}(s_1^{(j)}, S_2, k_1-1, k_2)}\), for all \(j \in D\). If any of these recursive calls returns a string \(s\), then return this string \(s\) (choosing arbitrarily between strings returned by different recursive calls). If all recursive calls return "No", then return "No".

Lemma 16.7: It takes \(O^*\bigl((k+1)^k\bigr)\) time to decide whether a set \(S\) of strings of length \(n\) has a consensus string \(s\) that satisfies \(d(s, s_i) \le k\) for all \(s_i \in S\) and, if so, find such a string \(s\).

Proof: The above algorithm applied to the instance \((s_1, S_2, k, k)\) has branching vector \(\underbrace{(1, \ldots, 1)}_{k+1}\) and thus has running time \(O^*\bigl((k+1)^k\bigr)\). We have to prove its correctness.

If \(k_1 < 0\), then there exists no string \(s\) at all with \(d(s, s_1) \le k_1\). Thus, the invocation \(\boldsymbol{\textbf{CS}(s_1, S_2, k_1, k_2)}\) correctly returns "No".

If \(k_1 \ge 0\) and \(d(s_1, s_i) \le k_2\) for all \(s_i \in S_2\), then the string \(s = s_1\) satisfies \(d(s, s_1) = 0 \le k_1\) and \(d(s, s_i) \le k_2\) for all \(s_i \in S_2\). Thus, \((s_1, S_2, k_1, k_2)\) is a yes-instance and the invocation \(\boldsymbol{\textbf{CS}(s_1, S_2, k_1, k_2)}\) correctly returns the string \(s_1\).

If \(k_1 \ge 0\) and \(d(s_1, s_i) > k_2\) for some string \(s_i \in S_2\), then define the set \(D\) and the set of strings \(\bigl\{s_1^{(j)} \mid j \in D\bigr\}\) as in the algorithm.

If any recursive call \(\boldsymbol{\textbf{CS}\bigl(s_1^{(j)}, S_2, k_1-1, k_2\bigr)}\) returns a string \(s\), then \(d\bigl(s, s_1^{(j)}\bigr) \le k_1-1\) and \(d(s,s_h) \le k_2\) for all \(s_h \in S_2\). Since \(d\bigl(s_1^{(j)}, s_1\bigr) = 1\), we have \(d(s,s_1) \le d\bigl(s, s_1^{(j)}\bigr) + d\bigl(s_1^{(j)},s_1\bigr) \le k_1-1 + 1 = k_1\). Thus, \((s_1, S_2, k_1, k_2)\) is a yes-instance and the algorithm correctly returns the string \(s\).

If all recursive calls return "No", then we need to prove that \((s_1, S_2, k_1, k_2)\) is a no-instance. Assume the contrary. Then there exists a string \(s = y_1 y_2 \cdots y_n\) with \(d(s, s_1) \le k_1\) and \(d(s, s_h) \le k_2\) for all \(s_h \in S_2\). In particular, \(d(s, s_i) \le k_2\). Since \(|D| = k_2 + 1\), this implies that there exists an index \(j \in D\) such that \(y_j = x_{i,j} = x_{1,j}^{(j)} \ne x_{1,j}\). Thus, \(d\bigl(s, s_1^{(j)}\bigr) = d(s, s_1) - 1 \le k_1 - 1\) and \(\bigl(s_1^{(j)}, S_2, k_1-1, k_2\bigr)\) is a yes-instance. In particular, the recursive call \(\boldsymbol{\textbf{CS}\bigl(s_1^{(j)}, S_2, k_1-1, k_2\bigr)}\) does not return "No", a contradiction. ▨


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.