15.2.2. Cluster Editing
Consider the problem of partitioning a set of \(n\) data points into clusters. Often, the first step is to compute an undirected graph \(G\) whose vertices correspond to the data points and with an edge \((u, v)\) between two points if \(u\) and \(v\) are sufficiently similar that they should be placed in the same cluster, based on an appropriate similarity criterion that depends on the application. If there is no edge \((u, v)\), then \(u\) and \(v\) are sufficiently dissimilar that they should not be placed into the same cluster. A clustering now partitions the set \(V\) of vertices (data points) into a number of subsets \(V_1, \ldots, V_t\). This clustering would be perfect if \((u, v) \in E\) for every pair of vertices in the same set \(V_i\), and \((u, v) \notin E\) for every pair of vertices in different sets \(V_i\) and \(V_j\). In this case, \(V_1, \ldots, V_t\) are simply the vertex sets of the connected components of \(G\) and each connected component is a clique (a complete graph).
The real world looks more messy: We may end up having to place two vertices \(u\) and \(v\) in the same cluster even though there is no edge \((u, v)\). This can happen if there exists a path from \(u\) to \(v\) in \(G\). We may also end up placing vertices \(u\) and \(v\) into different clusters even though they are connected by an edge. This can happen if \(u\) is connected to some set of vertices \(V_i\), \(v\) is connected to some set of vertices \(V_j\), but there are not many edges between \(V_i\) and \(V_j\).
What is the best cluster partition based on such noisy input? A perfect answer does not exist and the definition of a good answer depends on the application. One approach is to aim for a cluster partition that violates as few edges as possible: It minimizes the number of vertex pairs where either \(u\) and \(v\) end up in the same cluster even though there is no edge \((u, v)\) or \(u\) and \(v\) end up in different clusters even though there is an edge \((u, v)\). The cluster editing problem formalizes this task:
A graph is a clique if every pair of edges in the graph is connected by an edge.
Cluster Editing Problem: Given an undirected graph \(G\), find the minimum number of edges to add to or remove from \(G\) such that the connected components of the resulting graph \(G^*\) are cliques.
We call \(G^*\) a cluster graph, an edge addition or deletion an edge edit, and the set of edge edits that produce \(G^*\) from \(G\) a cluster edit set.
Cluster editing is yet another classical NP-hard problem. In the parameterized version, we are given a parameter \(k\) along with \(G\) and our task is to decide whether \(G\) has a cluster edit set of size at most \(k\). Here we prove that this problem has a problem kernel with \(O\bigl(k^2\bigr)\) vertices and \(O\bigl(k^3\bigr)\) edges. The kernelization algorithm applies the following three reduction rules. It first applies the first two rules until neither is applicable anymore. Once neither of the first two rules is applicable, it applies the third rule once to obtain the final kernel.
Many shared neighbours: If \(k > 0\) and there are \(k+3\) vertices \(u, v, w_1, \ldots, w_{k+1}\) in \(G\) such that \((u,v) \notin G\) and \(w_1, \ldots, w_{k+1}\) are neighbours of both \(u\) and \(v\) in \(G\), then add \((u,v)\) to \(G\) and decrease \(k\) by \(1\). See Figure 15.1.
Many unique neighbours: If \(k > 0\) and there are \(k+3\) vertices \(u, v, w_1, \ldots, w_{k+1}\) in \(G\) such that \((u,v) \in G\) and each of the vertices \(w_1, \ldots, w_{k+1}\) is a neighbour of \(u\) or \(v\) but not both, then remove \((u,v)\) from \(G\) and decrease \(k\) by \(1\). See Figure 15.2.
Remove cliques: Remove every connected component of \(G\) that is a clique.
Figure 15.1: Coming soon!
Figure 15.2: Coming soon!
Note that we obtain a fully reduced instance after at most \(k\) applications of the first two rules and one application of the third rule because every application of the first rule decreases \(k\) by one and the first two rules are applicable only as long as \(k\) is positive.
Lemma 15.6: The Many Shared Neighbours rule is sound: If \(k > 0\) and there exist \(k + 3\) vertices \(u, v, w_1, \ldots, w_{k+1}\) such that \((u,v) \notin G\) and \(w_1, \ldots, w_{k+1}\) are neighbours of \(u\) and \(v\), then \(G = (V, E)\) has a cluster edit set of size at most \(k\) if and only if the graph \(G' = (V, E \cup \{(u,v)\})\) has a cluster edit set of size at most \(k - 1\).
Proof: First assume that \(G^*\) is a cluster graph obtainable from \(G'\) using at most \(k-1\) edits. Then \(G^*\) can also be obtained from \(G\) by first adding the edge \((u,v)\), to produce \(G'\), and then applying the at most \(k - 1\) edits that produce \(G^*\) from \(G'\). Thus, \(G\) has a cluster edit set of size at most \(k\) if \(G'\) has a cluster edit set of size at most \(k - 1\).
Conversely, assume that \(G^*\) is a cluster graph obtainable from \(G\) using at most \(k\) edits. We prove that \(G^*\) must contain the edge \((u,v)\). This implies that the cluster edit set that produces \(G^*\) from \(G\) includes the addition of the edge \((u,v)\). This addition produces \(G'\) from \(G\), and \(G^*\) can then be obtained from \(G'\) using the remaining \(k - 1\) edits. Thus, \(G'\) has a cluster edit set of size at most \(k - 1\) if \(G\) has a cluster edit set of size at most \(k\).
To prove that \(G^*\) must contain the edge \((u,v)\), assume that it does not. Then \(u\) and \(v\) belong to different connected components of \(G^*\) because \(G^*\) is a cluster graph. This implies that the cluster edits that produce \(G^*\) from \(G\) must remove at least one edge from every path in \(G\) from \(u\) to \(v\). However, \(G\) contains \(k + 1\) edge-disjoint paths from \(u\) to \(v\), namely the paths \(\langle u, w_1, v\rangle, \ldots, \langle u, w_{k+1}, v \rangle\). To remove one edge from each of these paths, we need to remove at least \(k + 1\) edges from \(G\), that is, \(G^*\) cannot be obtained from \(G\) using at most \(k\) edge edits if \(G^*\) is a cluster graph and \((u,v) \notin G^*\). ▨
Lemma 15.7: The Many Unique Neighbours rule is sound: If \(k > 0\) and there exist \(k + 3\) vertices \(u, v, w_1, \ldots, w_{k+1}\) such that \((u,v) \in G\) and each of the vertices \(w_1, \ldots, w_{k+1}\) is a neighbour of \(u\) or \(v\) but not both, then \(G = (V, E)\) has a cluster edit set of size at most \(k\) if and only if the graph \(G' = (V, E \setminus \{(u,v)\})\) has a cluster edit set of size at most \(k - 1\).
Proof: First assume that \(G^*\) is a cluster graph obtainable from \(G'\) using at most \(k-1\) edits. Then \(G^*\) can also be obtained from \(G\) by first removing the edge \((u,v)\), to produce \(G'\), and then applying the at most \(k - 1\) edits that produce \(G^*\) from \(G'\). Thus, \(G\) has a cluster edit set of size at most \(k\) if \(G'\) has a cluster edit set of size at most \(k - 1\).
Conversely, assume that \(G^*\) is a cluster graph obtainable from \(G\) using at most \(k\) edits. We prove that \(G^*\) cannot contain the edge \((u,v)\). This implies that the cluster edit set that produces \(G^*\) from \(G\) includes the removal of the edge \((u,v)\). This removal produces \(G'\) from \(G\), and \(G^*\) can then be obtained from \(G'\) using the remaining \(k - 1\) edits. Thus, \(G'\) has a cluster edit set of size at most \(k - 1\) if \(G\) has a cluster edit set of size at most \(k\).
To prove that \(G^*\) cannot contain the edge \((u,v)\), assume that it does. Then \(u\) and \(v\) belong to the same connected component of \(G^*\). Since \(G^*\) is a cluster graph, any vertex that is in the same connected component must be connected to both \(u\) and \(v\); any vertex that is in a different connected component cannot be connected to either \(u\) or \(v\). Now assume w.l.o.g. that there exists an index \(h\) such that \(w_1, \ldots, w_h\) belong to the same connected component as \(u\) and \(v\) and \(w_{h+1}, \ldots, w_{k+1}\) belong to different connected components than \(u\) and \(v\). Since \(G\) contains only one of the edges \((u,w_i)\) or \((v,w_i)\) for all \(1 \le i \le h\), the cluster edit set producing \(G^*\) from \(G\) must include the addition of one of these two edges for all \(1 \le i \le h\). Similarly, this cluster edit set must remove one of the edges \((u,w_i)\) or \((v,w_i)\) for all \(h < i \le k+1\). Thus, it takes at least \(k + 1\) edits to produce \(G^*\) from \(G\) if \(G^*\) is a cluster graph that contains the edge \((u,v)\). ▨
Lemma 15.8: The Remove Cliques rule is sound: If \(G'\) is the graph obtained from \(G\) by removing all connected components that are cliques, then \(G'\) has a cluster edit set of size at most \(k\) if and only if \(G\) does.
Proof: Let \(V_C\) be the vertex set of all connected components of \(G\) that are cliques. Then this rule produces the graph \(G' = G[V \setminus V_C]\). If \((G', k)\) is a yes-instance, then \(G'\) has a cluster edit set \(M\) of size at most \(k\). Let \(G^*\) be the cluster graph produced by applying the edits in \(M\) to \(G'\). Applying the edits in \(M\) to \(G\) produces the cluster graph \(G^* \cup G[V_C]\). Thus, \((G, k)\) is also a yes-instance.
If \((G, k)\) is a yes-instance, then \(G\) has a cluster edit set \(M\) of size at most \(k\). Let \(G^*\) be the cluster graph obtained by applying the edits in \(M\) to \(G\). Since every vertex-induced subgraph of a cluster graph is itself a cluster graph, \(G^*[V \setminus V_C]\) is a cluster graph that can be obtained from \(G' = G[V \setminus V_C]\) using the subset of edits in \(M\) involving only edges \((u, v)\) such that \(u, v \notin V_C\). Thus, \((G', k)\) is also a yes-instance. ▨
By Lemma 15.6–15.8, the three reduction rules are sound. To bound the size of a fully reduced instance with respect to these reduction rules, we need one more observation:
Observation 15.9: If \(M\) is a minimum cluster edit set for a graph \(G\), then every edge edit in \(M\) adds or removes an edge \((u, v)\) such that \(u\) and \(v\) belong to the same connected component of \(G\).
Proof: Let \(G^*\) be the cluster graph obtained by applying the edge edits in \(M\) to \(G\) and let \(V_1, \ldots, V_t\) be the vertex sets of the connected components of \(G\). For all \(1 \le i \le t\), let \(M_i\) be the set of edge edits in \(M\) that add or remove edges \((u, v)\) with \(u, v \in V_i\). Since \(G^*[V_i]\) is a vertex-induced subgraph of \(G^*\), it is also a cluster graph, obtained from \(G[V_i]\) using the edge edits in \(M_i\). Thus, \(G^*[V_1] \cup \cdots \cup G^*[V_t]\) is a cluster graph obtainable from \(G\) using the edits in \(M_1 \cup \ldots \cup M_t\). Therefore, if \(M \supset M_1 \cup \ldots \cup M_t\), then \(M\) is not a minimum cluster edit set for \(G\). ▨
The following lemma now shows that a fully reduced instance \((G, k)\) with more than \((2k+1)2k\) vertices or more than \(\binom{2k+1}{2}2k + k\) edges is a no-instance:
Lemma 15.10: If \((G, k)\) is a fully reduced yes-instance with respect to the Many Shared Neighbours, Many Unique Neighbours, and Remove Cliques rules, then \(G\) has at most \((2k+1)2k\) vertices and at most \(\binom{2k+1}{2}2k + k\) edges.
Proof: Assume that \((G, k)\) is a fully reduced yes-instance with respect to the Many Shared Neighbours, Many Unique Neighbours, and Remove Cliques rules, and let \(G^*\) be a clique graph that can be obtained from \(G\) using a set \(M\) of at most \(k\) edge edits.
Let \(G_1, \ldots, G_t\) be the connected components of \(G\). By Observation 15.9, we can partition \(M\) into \(t\) subsets \(M_1, \ldots, M_t\) such that every edit in \(M_i\) adds or removes an edge between two vertices in \(G_i\). Since \(G\) is fully reduced with respect to the remove-cliques rule, none of the connected components \(G_1, \ldots, G_t\) is a clique. This implies that none of the sets \(M_1, \ldots, M_t\) is empty. Since these sets are disjoint, \(M = M_1 \cup \cdots \cup M_t\), and \(|M| \le k\), this shows that \(t \le k\). Even if all edits in \(M_i\) are edge deletions, then these deletions partition \(G_i\) into at most \(|M_i| + 1\) connected components of \(G^*\). Thus, \(G^*\) has at most
\[\sum_{i=1}^t (|M_i| + 1) = |M| + t \le 2k\]
connected components.
Next we prove that no connected component of \(G^*\) has more than \(2k + 1\) vertices. Therefore, \(G^*\) has at most \((2k+1)2k\) vertices and at most \(\binom{2k+1}{2}2k\) edges. Since \(G\) has the same vertex set as \(G^*\), this shows that \(G\) has at most \((2k+1)2k\) vertices. Since \(G^*\) is obtained from \(G\) using at most \(k\) edge edits, it also shows that \(G\) has at most \(\binom{2k+1}{2}2k + k\) edges.
To prove that no connected component of \(G^*\) has more than \(2k+1\) vertices, assume the contrary and let \(C\) be a connected component of \(G^*\) with at least \(2k+2\) vertices. Let \(V_C\) be the vertex set of \(C\).
First we prove that \(G[V_C]\) is a clique. Assume the contrary. Then there exist two vertices \(u, v \in V_C\) such that \((u,v) \notin G\). Let \(w_1, \ldots, w_{2k}\) be \(2k\) vertices in \(V_C \setminus \{u,v\}\). Since \(G^*\) contains the edges \((u,w_i)\) and \((v,w_i)\) for all \(1 \le i \le 2k\), \(G^*\) is obtained from \(G\) using at most \(k\) edge edits, and one of these edge edits must be the addition of the edge \((u,v)\), there are at least \(k+1\) vertices among \(w_1, \ldots, w_{2k}\) such that \(G\) contains both edges \((u,w_i)\) and \((v,w_i)\). Thus, the Many Shared Neighbours rule applies to the edge \((u,v)\) and \(G\) is not fully reduced, a contradiction.
Since \(G\) is fully reduced, no connected component of \(G\) is a clique. Since \(G[V_C]\) is a clique, this implies that \(G[V_C]\) is properly contained in a connected component of \(G\), that is, there exists an edge \((u,v) \in G\) such that \(u \in V_C\) and \(v \notin V_C\). Let \(w_1, \ldots, w_{2k+1}\) be \(2k+1\) vertices in \(V_C \setminus \{u\}\). \(G^*\) contains the edge \((u,w_i)\) but not the edge \((v,w_i)\) for all \(1 \le i \le 2k+1\). Since \(G^*\) is obtained from \(G\) using at most \(k\) edge edits, there exist at least \(k+1\) vertices among \(w_1, \ldots, w_{2k+1}\) such that \(G\) contains the edge \((u,w_i)\) but not the edge \((v,w_i)\). Thus, the Many Unique Neighbours rule applies to the edge \((u,v)\) and \(G\) is not fully reduced, a contradiction. This proves that every connected component of \(G^*\) has at most \(2k+1\) vertices and, as argued above, the lemma follows. ▨

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.