Read the paper Proportional Fairness in Non-Centroid Clustering by Caragiannis et al.

This paper was accepted in NeurIPS 2024!

Introduction

nn points, kk clusters C=(C1,,Ck)C=(C_1,\cdots,C_k), distance metric dd.

kk-means objective: i=1k1Cix,yCid(x,y)2\sum_{i=1}^k\frac{1}{|C_i|}\cdot\sum_{x,y\in C_i}d(x,y)^2.

Scenario in which there are no cluster centers.

Question:

  1. Can we obtain PROP guarantee for non-centroid clustering?
  2. Do the algorithms for centroid clustering also work for non-centroid one?
  3. Can we audit the PROP of a given algorithm?

Contribution

Each agent ii has a loss function i\ell_i, the loss under clustering CC is i(C(i))\ell_i(C(i)). C(i)C(i) denotes the cluster containing her.

  • i\ell_i can be arbitrary
  • i\ell_i can be distance

PROP fairness guarantee:

  1. the core
  2. and its relaxation, fully justified representation FJR.

Summary of results

Loss functionsCore UBCore LBFJR
Arbitrary\infty\infty11
AverageO(n/k)O(n/k) polytime1.31.311 (44 in polytime)
Maximum22 polytime1111 (22 in polytime)

FJR as a relaxation of the core. This paper propose GreedyCohesiveClustering, which is an adaptation of Greedy Cohesive Rule from social choice theory.

PROP by Chen et al.

PROP revisit by Micha and Shah

FJR in centroid clustering by Aziz et al.

Model

[t]={1,,t}[t]=\{1,\cdots,t\} for tNt\in\N. Given a set NN of nn agents, and the desired number of clusters kk.

Each agent iNi\in N has an associated loss function i:2N2NiR0\ell_i:2^N\diagdown2^{N\diagdown i}\to\R_{\ge0}, where i(S)\ell_i(S) is the cost to agent ii for being part of group SS. A kk-clustering C=(C1,,Ck)C=(C_1,\cdots,C_k) is a partition of NN into kk clusters, where CtCt=C_t\cap C_{t'}=\empty for ttt\neq t' and t=1kCt=N\cup_{t=1}^kC_t=N. For simplicity, C(i)C(i) is the cluster containing ii.

The loss of ii is i(C(i))\ell_i(C(i)).

Loss function

  • A distance metric over NN is given by d:N×NR0d:N\times N\to\R_{\ge0}, which satisfies normality, symmetry and triangle inequality.
  • Arbitrary losses. The loss i(S)\ell_i(S) is arbitrary non-negative number for each agent iNi\in N and cluster SiS\ni i.
  • Average loss. Given a distance metric dd over NN, i(S)=1SjSd(i,j)\ell_i(S)=\frac{1}{|S|}\sum_{j\in S}d(i,j) for each agent iNi\in N and cluster SiS\ni i. Agent ii prefers the agents in her cluster to be close to her on average.
  • Maximum loss. Given a distance metric dd, i(S)=maxjSd(i,j)\ell_i(S)=\max_{j\in S}d(i,j). Agent ii prefers that no agent in her cluster to be too far from her.

Core

If no group of agents SNS\subseteq N can choose another outcome that

  1. They are entitled to choose based on their proportion of the whole population S/N|S|/|N|
  2. Every member of group SS is happier

📖Definition 1 (α\alpha-Core). For α1\alpha\ge1, a kk-clustering C=(C1,,Ck)C=(C_1,\cdots,C_k) is said to be in the α\alpha-core if there is no group of agents SNS\subseteq N with Sn/k|S|\ge n/k s.t. αi(S)<i(C(i))\alpha\cdot\ell_i(S)\lt\ell_i(C(i)) for all iSi\in S. 11-core is the core.

If a group SS exists that becomes a violation of the α\alpha-core guarantee, we say that it deviates and refer to it as the deviating coalition.

🎯Theorem 1. For arbitrary losses, there exists an instance in which no α\alpha-core clustering for any finite α\alpha.

🎯Theorem 2. For the average loss, there exists an instance in which no α\alpha-core clustering exists an instance in which no α\alpha-core clustering exists for α<1+321.366\alpha\lt\frac{1+\sqrt3}{2}\approx1.366.

❓Open problem 1: For the maximum loss, does there always exist a clustering in the core?

❓Open problem 2: For the average loss, does there always exist a clustering in the α\alpha-core for some constant α\alpha?

GreedyCohesiveClustering Algorithm

It uses a subroutine A\mathcal A, in which given a subset of agents NNN'\subseteq N, metric dd, and threshold τ\tau, finds a cohesive cluster SS.

The threshold τ\tau indicate the smallest size at which a group of agents deserve to form a cluster.

A\mathcal A is SmallestAgentBall. It finds the smallest ball centered at agent that captures at least τ\tau agents, and returns a set of τ\tau agents from this ball.

  • Natural choice: τ=n/k\tau=\lceil n/k\rceil

So GreedyCohesiveClustering with SmallestAgentBall iteratively finds the smallest agent-centered ball containing n/k\lceil n/k\rceil agents and removes n/k\lceil n/k\rceil in that ball, until fewer than n/k\lceil n/k\rceil agents remain, at which point all remaining agents are put into one cluster and any remaining clusters are left empty.

Pseudo code for GreedyCohesiveClustering

Input: N,d,kN,d,k

Output: kk-clustering CC

  1. NNN'\gets N // remaining set of agents
  2. j1j\gets 1 // current cluster member
  3. while NN'\neq\empty do
    • CjA(N,d,n/k)C_j\gets\mathcal A(N',d,\lceil n/k\rceil) // find and remove the next cohesive cluster
    • NNCjN'\gets N'\diagdown C_j
    • jj+1j\gets j+1
  4. Cj,Cj+1,,CkC_j,C_{j+1},\cdots,C_k\gets\empty
  5. return C=(C1,,Ck)C=(C_1,\cdots,C_k)

Pseudo code for SmallestAgentBall

Input: Subset of agent NN,d,τN'\subseteq N,d,\tau

Output: a cluster SS

  1. if Nτ|N'|\le\tau then return SNS\gets N'
  2. if iNi\in N' do
    • iτ\ell_i\gets\tau-th closest agent in NN' to agent ii // arbitrarily broke the tie
    • rid(i,i)r_i\gets d(i,\ell_i) // smallest ball centered at agent ii capturing at least τ\tau agents
  3. iargminiNrii^*\gets\arg\min_{i\in N'}r_i
  4. return SS\gets the set of τ\tau closest agents in NN' to agent ii^*

Actually the GreedyCohesiveClustering is an adaptation of Greedy Capture by Chen et al. There are two key difference:

  1. Growing balls centered at the agents instead of all feasible candidate centers.
  2. Stop a ball as soon as it captures n/k\lceil n/k\rceil agents. While in GC, it continues.

🎯Theorem 3. For the average (resp. maximum) loss, the Greedy Capture algorithm is guaranteed to return a clustering in the (2n/k3)(2\lceil n/k\rceil -3)-core (resp. 22-core) in O(kn)O(kn) time complexity, and the bounds are tight.

Fully Justified Representation

Peters et al. introduced FJR as a relaxation of the core in the context of approval-based committee selection.

📖 Definition 2 (α\alpha-FJR). For α1\alpha\ge1, a kk-clustering CC satisfies α\alpha-FJR if there is no group of agents SNS\subseteq N with Sn/k|S|\ge n/k s.t. αi(S)<minjSj(C(j))\alpha\cdot\ell_i(S)\lt\min_{j\in S}\ell_j(C(j)) for each iSi\in S, i.e., if αmaxiSi(S)<minjSj(C(j))\alpha\cdot\max_{i\in S}\ell_i(S)\lt\min_{j\in S}\ell_j(C(j)). We refer to 11-FJR as FJR.

🤔Proposition 1. For α1\alpha\ge1, α\alpha-core implies α\alpha-FJR for arbitrary loss functions.

Trivially, i(C(i))minjSj(C(j))\ell_i(C(i))\ge\min_{j\in S}\ell_j(C(j)).

Arbitrary Loss Functions

FJR clustering is guaranteed to exist even for arbitrary losses.

📖Definition 3 (most cohesive cluster). Given a set of agents NN and a threshold τ\tau, the MostChesiveCluster problem asks to find a cluster SNS\subseteq N of size at least τ\tau s.t. the maximum loss of any iSi\in S for SS is minimized, i.e., find argminSN:SτmaxiSi(S)\arg\min_{S\subseteq N':|S|\ge\tau}\max_{i\in S}\ell_i(S).

For λ1\lambda\ge1, a λ\lambda-approximate solution SS satisfies λmaxiSi(S)maxiSλi(S)\lambda\cdot\max_{i\in S}\ell_i(S)\le\max_{i\in S'}\lambda_i(S') for all SNS'\subseteq N with Sτ|S'|\ge\tau, and a λ\lambda-approx algorithm returns a λ\lambda-approx solution on every instance.

If we plug in A\mathcal A to the MostCohesiveCluster problem into GreedyCohesiveClustering algorithm yields λ\lambda-FJR clustering.

🎯Theorem 4. For arbitrary losses, α1\alpha\ge1, and an α\alpha-approximation algorithm A\mathcal A for the MostCohesiveCluster problem, GreeedyCohesiveClustering is guaranteed to return a α\alpha-FJR clustering. Hence, FJR is guaranteed to exist.

Proof omitted.

Average and Maximum Loss Functions

Let A\mathcal A^* be an exact ALG for the MostCohesiveCluster problem for the average loss.

It may run not in polynomial time. It can be used to detect whether a given undirected graph admits a clique of at least a given size (NP-Complete).

🤔Lemma 1. For the average (resp. maximum) loss, SmallestAgentBall is a 44-approximation (resp. 22-approximation) algorithm for the MostCohesiveCluster problem, and this is tight.

🤔Corollary 1. The Greedy Capture algorithm is guaranteed to return a clustering that is 44-FJR (resp. 22-FJR) for the average (resp. maximum) loss.

❓Open problem 3: For the average or maximum loss, what is the smallest α\alpha for which an α\alpha-FJR clustering can be computed in polynomial time, assuming PNPP\neq NP?

❓Open problem 4: What is the smallest α\alpha s.t. there always exists a clustering that is simultaneously α\alpha-FJR for both the average loss and the maximum loss?

Auditing FJR

Ideas used to find an approx FJR can also be used to audit the FJR approx of any given clustering.

📖Definition 4 (λ\lambda-approx FJR auditing). A\mathcal A is a λ\lambda-approx FJR auditing algorithm, if given any clustering CC, it returns θ\theta s.t. the exact FJR approx of CC is in [θ,λθ][\theta,\lambda\cdot\theta].

🎯Theorem 5. For λ1\lambda\ge1, if A\mathcal A is a λ\lambda-approx algorithm to the MostCohesiveCluster problem, then AuditFJR(A)(\mathcal A) is a λ\lambda-approx FJR auditing algorithm. Given lemma 1, it follows that for the average (resp. maximum) loss, AuditFJR(SmallestAgentBall) is an efficient 44-approx (resp. 22-approx) FJR auditing algorithm.

❓Open problem 5: Does there exist a polynomial time, α\alpha-approx core auditing algorithm for some constant α\alpha?

🎯Theorem 6. Assuming PNPP\neq NP, there does not exist a polynomial-time λ\lambda-approx FJR auditing algorithm for the maximum loss, for any λ<2\lambda\lt2,

Experiments

Omitted

Discussion

Introduction to Proportional Fairness in Non-Centroid Clustering: This paper initiates the study of proportional fairness in non-centroid clustering and presents several open questions for further exploration.

Key Open Questions:

  • Achieving a better approximation than O(n/k)O(n/k) for the core concerning average loss.
  • Determining if the core is always non-empty for maximum loss.

Core Non-Emptiness Results:

  • Proven that the core is always non-empty for maximum loss in 1-dimensional metric spaces (Appendix C.1).
  • For average loss, however, the core remains empty even on the line (Appendix C.2).

Differences Between Clustering Settings:

  • Highlights significant distinctions between centroid and non-centroid clustering.

Future Directions:

  • Investigate proportional fairness guarantees when an agent’s loss depends on both their cluster center and other agents in their cluster.
  • Explore the possibility of intrinsically choosing the number of clusters kk, which currently requires being fixed in advance for proportional fairness.

Compatibility with Classical Algorithms:

  • While traditional algorithms like kk-means and kk-centers may be incompatible with the core and FJR (Appendix E), studying conditions under which they might align with fairness criteria or can be efficiently computed is an interesting avenue.