Fair AI/ML 7 The Existence of Core for Max Distance Loss
The Existence of Core
Is the core empty?
In the paper, we have proved that Greedy Capture computes -core in time for maximum distance loss.
❓Is the the lower bound for code?
Special Case: Binary Distance Value
Consider a graph , where for every pair , if , else .
If there exist a graph such that no matter how we assign the clusters, there always exist a clique of size at least , then the lower bound of is proved.
Split&Merge Algorithm
In any given aforementioned graph , the cluster in the core always exist. (But this can not prove the actual core always exist.)
Consider the algorithm below:
Algorithm Split-and-Merge:
Input: Graph ,
Output: Clustering in the actual core
- while , do // SPLIT PHASE
- while , do // MERGE PHASE
- return
Analysis of Split&Merge
It’s straightforward to see that Split&Merge is a greedy algorithm.
Since Split&Merge uses as an oracle, it is a NP solution.
🎯Theorem: Split-and-Merge always return the cluster in the actual core.
Claim: If a set of points form a cluster, which means, the cost of everybody is . All of them have no incentive to form a coalition with anyone outside of the clique.
Thus before merging, any subset of that intersects at least two item-sets in , can not be a clique.
❗By pigeonhole principle, before merging, any item-set in with size already greater than would not be picked in the merge phase! All data points in these clusters have no incentive to deviate.
In the merge phase, every point who would have been merged have the cost . But they can not find other points to form a clique with size at least .
If there exist a clique with size at least in the merge phase, by pigeonhole principle and the property of maximum clique, it should be chosen in the split phase. Which leads to a contradiction.
🤔Can we extend Split&Merge to more general cases?
Special Case: Trinary Distance Value
Consider a group of points , for every pair of points, their distance can only be . (Now matter how the distances are, they always subject to triangle inequality)
If there exists an instance, where no matter how we assign clusters, the approximation of core cannot be better than , then we prove that the bound for inapproximability of core is .
It seems that we can extend Split&Merge in this case to produce clustering in the core.
Extend Split&Merge for Trinary Distance Value
The extended version of Split&Merge:
Input: and the distance between each pair of points
- construct a graph , where corresponds to the set of data points. For a pair , there is an edge if the distance between and is .
- while do // find the clique of distance with large enough size
- construct a graph , where corresponds to the free data points. For a pair , there is an edge if the distance between and is either or
- Split&Merge // run Split&Merge on remaining points
- return
Analysis Split&Merge for Trinary Case
For simplicity, we introduce the notation of core-violator: A core-violator is a subset with size , where each point in has a strictly better cost if they are clustered together.
🤔Claim: extended version of Split&Merge always return a clustering in the core.
Assume that Split&Merge returns a clustering not in the core. Which means there exist a core-violator .
Since the minimal distance is and the cost of each points in the clique in is exactly before Split&Merge, core-violator can’t exist there.
The only possible existence of core-violator is in the Split&Merge step.
- Claim: It is not possible that the a free point at step 5 can collude with any point in the clique formed in step 4.
- The only possible core-violator can formed only among free points at step 5.
- ❗️By the property of step 4, any subset of free points at step 5 can not form a core-violator with distance between any pair. Otherwise they they should be clustered in step 4.
- Thus, the possible core-violator must contain a free point improving cost from to , meaning that there exist a large enough clique in after the merge phase.
- By pigeonhole principle, this contradicts the property of split phase (This large enough clique must be chosen in split phase).
Thus the extended version of Split&Merge guarantees the core property.
❓Can we further extend Split&Merge to general cases where there are distances?
The answer is YES if I can successfully extend splir&merge to general cases
Extend Split&Merge to General Cases
For data points in a metric space, given number of clusters .
We construct the complete weighted undirected graph where corresponds to the set of data points, for each pair , the weight is the distance between and . There are edges in .
First, we sort the edges by weight.
Start from another graph with and no edge. We iteratively add an edge from the scratch in the ascending order.
- At each time, we check whether there exist a clique of size at least .
- If so, we immediately include them into a cluster and delete them temporarily.
🤔Intuitively I think this extension outputs a cluster in the core. But I haven’t well organized a formal proof.