Fair AI/ML 8 Split and Merge Algorithm on Average Cost
硕士论文就跟 Non-Centroid Clustering 杠上了。我和导师都希望能
- 探索 Split and Merge 算法的更多性质
- 扩展 Split and Merge 算法到更一般的情况
仍然只考虑:
binary distance
the distance of each pair of data point is either or .
non-metric
no restriction of triangle inequality
asymmetric distance
Equivalent Setting
Those three conditions is equivalent to APPROVAL VOTING.
- Each data point is a voter.
- Each voter approve a subset of other data points
Cost Definition
The maximum and average costs are therefore defined equivalently and accordingly.
The maximum cost:
The minimum cost:
It’s not hard to prove that Split and Merge algorithm also computes a clustering in the core with BINARY, ASYMMETRIC, NON-METRIC “distance”.
On Average Cost
🎯Theorem (On Average Cost): Split and Merge algorithm computes a clustering in -core.
Assume there exist a subset of agents where each agent would improve more than multiplicative factor .
In Split and Merge algorithm, the average cost of data point is at most .
Therefore, for each subset of , there exist one agent does not improve its cost (from to ). (Consider maximum cost, Split and Merge computes the solution in the core, which means for each subset of , not all agent would be strictly better off.)
Thus, the average cost of this agent is at most . Possible improved average cost is at least , meaning that the improvement can not be greater than factor of .
Can we improve this factor? No idea… 🤔
Breakthrough: A Variation of Split and Merge Computes Better Approximation w.r.t. Average Cost
Non-metric, asymmetric and binary distance.
Average cost function.
Construct directed graph.
Split phase: Do iteratively
If there exist -clique, put them in one cluster and discard them.
else if there exist such a cluster, put then in one cluster and discard them:
The largest group of points s.t. no points is at distance from other points.
// This step is as hard as MAXCLIQUE.
Merge phase: Pick the smallest clusters, merge them in order to have at most clusters.
🎯Theorem (-core w.r.t. average cost): This variation outputs a clustering solution in -core.
Assume there exist a subset of agent improving more than . In the variation, average cost of an agent is .
For each subset of agents, there exist one agent having average cost . (By the greedy property of second level of split phase)
The possible improved average cost is at least . Meaning that improvement can not be more than , this leads to contradiction.
Following Directions on Average Cost
From the paper, we learn that for symmetric, metric, arbitrary distance, the lower bound of core approximation factor is .
🤔But what if we consider binary distance remove symmetric and metric limitation? Could the core be non-empty in this case?
🤔What about graphs with certain structures? (scale-free, small-world networks…)