Fair AI/ML 6 Some Open Problems about Non-centroid Clustering
In this post, we focus on upper and lower bound of core where non-clusterings are.
Let the loss (or cost) function for every data points be maximum distance between any other points and itself:
We have known that a clustering in the exact core must exist given .
For the approximation of core: The Greedy Capture algorithm computes the upper bound of in polynomial time. And it have the lower bound of .
What about the gap between 1 and 2???
There are some open problems and I am trying to answer.
Complexity for Auditing the Core
❓What is the time complexity for verifying whether a given clustering is in the exact core?
TL;DR It is a NP-hard problem.
The analysis of verifying problem:
We need to check if any subset with can form a cluster such that the maximum distance between any two points in is less than each agent’s maximum distance to others in its current cluster. If such a subset exists, then the clustering violates the core property.
Consider a reduction from another NP-complete problem . The problem asks if there is a clique of a specified size in an undirected graph .
Problem Setup: We have an undirected graph with vertices. We need to decide if a given clustering is in the core, with each data point’s cost being the maximum distance between itself and all other points in the same cluster.
Constructing the Graph: To reduce to our core-checking problem, we will build a graph with the properties:
- For every edge , the distance is less than
- the maximum distance between and all other points in ’s cluster
- the maximum distance between and all other points in ’s cluster.
Formulating the Core Condition in Terms of : If a cluster is in the core, no subset of data points, with , can improve their maximum distance cost by clustering together. If we interpret each cluster as requiring a subset of nodes that are all pairwise connected (thus forming a clique), we can state the core problem as follows:
- Determine if there exists a subset of nodes of size at least such that every pair of nodes in it are connected.
- This requires identifying the largest clique in and checking if it can be at least as large as , which directly maps to finding the maximum clique.
Verifying Core Membership: To verify if a clustering is in the core using this construction, we would:
- Check if there exists a subset with such that all nodes in have edges to each other (i.e., a clique).
- If such a subset exists, it would imply that these nodes could deviate by forming their own cluster with a smaller maximum distance, violating the core property.
- Thus, solving this for all clusters to determine if the clustering as a whole satisfies the core property would require us to solve a clique-finding problem.
Completing the NP-Hardness Reduction: Since finding a maximum clique is NP-hard, and since verifying the core property requires solving the maximum clique problem to check for the presence of sufficiently large cliques in each cluster, the problem of deciding if a clustering is in the core is NP-hard.
Complexity for Compute Core
❓What is the complexity for computing a clustering in the exact core?
I don’t have a clear idea yet.
But it maybe harder than . Maybe , or ?
Complexity for Compute α-Core
❓What is the complexity for computing a clustering in -core ()?
Is there exist polynomial algorithm?