I will begin a new project work 24 Fall: FAIRNESS IN AI/ML THROUGH THE LENS OF SOCIAL CHOICE

The basic outline is following Tutorial: Fairness in AI/ML via Social Choice by Evi Micha & Nisarg Shah in EC 2024.

The first paper I have read in last semester, Fairness in Federated Learning via Core-Stability, which was one of the possible topics.

Brief notes is here.

I will try to understand some mathematical proof in this semester.

Core-Stability

Revisit the definition of Core-Stability: A predictor θP\theta\in P is called core stable if there exist no other θP\theta'\in P and no subset SS of agents, s.t. Snui(θ)ui(θ)\frac{|S|}{n}u_i(\theta')\ge u_i(\theta) for all iSi\in S, with at least one strict inequality.

⚠️⚠️⚠️What is WRONG with the claim

Setting S=[n]S=[n] in definition, there is no predictor θP\theta'\in P where i[n]ui(θ)ui(θ)>n\sum_{i\in[n]}\frac{u_i(\theta')}{u_i(\theta)}\gt n.

According to the definition, when we plug in S=[n]S=[n], there exist no θP\theta'\in P s.t.

ui(θ)ui(θ)iSui(θ)ui(θ)1iSu_i(\theta')\ge u_i(\theta)\forall i\in S\\ \Lrarr\frac{u_i(\theta')}{u_i(\theta)}\ge 1\forall i\in S

with at least one strict inequality.

The problem is, can we directly sum it up, then we have iui(θ)ui(θ)>n\sum_i\frac{u_i(\theta')}{u_i(\theta)}\gt n???

There is a counterexample:

  • n=2n=2, θ\theta is core-stable, we only have two feasible predictors θ,θ:θθ\theta,\theta':\theta'\neq\theta, ϵ0+\epsilon\to0^+.

    agent 11agent 22
    θ\theta1111
    θ\theta'2ϵ2-\epsilon1ϵ1-\epsilon
  • Verifying the core-stability of θ\theta:

    • For S={1}S=\{1\}, we have Snui(θ)=12u1(θ)=1ϵ2<1=u1(θ)\frac{|S|}{n}u_i(\theta')=\frac{1}{2}u_1(\theta')=1-\frac{\epsilon}{2}\lt1=u_1(\theta). Which satisfy the condition.
    • For S={2}S=\{2\}, we have 12u2(θ)=1ϵ2<1=u2(θ)\frac{1}{2}u_2(\theta')=\frac{1-\epsilon}{2}\lt1=u_2(\theta). Which satisfy the condition.
    • For S={1,2}S=\{1,2\}, we have u1(θ)=2ϵ>1=u1(θ)u_1(\theta')=2-\epsilon\gt1=u_1(\theta) and u2(θ)=1ϵ<1=u2(θ)u_2(\theta')=1-\epsilon\lt1=u_2(\theta). Which also satisfy the condition that some scaled utility is less than the original utility.

    But iui(θ)ui(θ)=u1(θ)u1(θ)+u2(θ)u2(θ)=32ϵ>2\sum_i\frac{u_i(\theta')}{u_i(\theta)}=\frac{u_1(\theta')}{u_1(\theta)}+\frac{u_2(\theta')}{u_2(\theta)}=3-2\epsilon\gt2!

Scaled factor S/n|S|/n suggests some kind of proportionality here. Basically core-stability means there is no big benefit for agent to form a coalition.

Core means “no deviating subgroup”

Core stability is very strong, only exists in special cases.

Advantage

More fair than FedAvg. MM is a very large integer.

Agent 1Agent 2Agent 3
θ1\theta_1aaaaMaM\cdot a
θ2\theta_210a10a10a10a0.9Ma0.9M\cdot a

In FedAvg, θ2\theta_2 is preferred, but not fair to Agent 1 and 2. (They have incentive to collide.)

Robustness to poor local data quality of agents.

❗Scaling the utility of any single agents does not change the core-stable allocation.

❗Some Ideas around Sum-of-ratio Property

Is there scenario s.t. none of predictors is core-stable?

Yes. The example

agent 11agent 22
θ\theta112+ϵ2+\epsilon
θ\theta'2+ϵ2+\epsilon11

shows that none of θ,θ\theta,\theta' is core-stable.

Some New Definitions

Ideas from Iannis. I was amazed by his ability to invent some new notions 😮

📖Definition of pessimistic price of core-stability:

PoCSpessimistic=maxθ is core-stablemaxθPi[n]ui(θ)ui(θ)PoCS_{pessimistic}=\max_{\theta\text{ is core-stable}}\max_{\theta'\in P}\sum_{i\in[n]}\frac{u_i(\theta')}{u_i(\theta)}

📖Definition of optimistic price of core-stability:

PoCSoptimistic=minθ is core-stablemaxθPi[n]ui(θ)ui(θ)PoCS_{optimistic}=\min_{\theta\text{ is core-stable}}\max_{\theta'\in P}\sum_{i\in[n]}\frac{u_i(\theta')}{u_i(\theta)}

Conjecture about Price of Core-Stability (Has been disproved)

If the set of core-stable predictors is non-empty, does one predictor in the set be the optimistic price of core-stability?

We want to proof/disproof this conjectures.

Let’s consider the scenario with two agents.

agents\predictorsθ\thetaθ\theta''
11?
21?

Assume that θ\theta is core-stable. The question is, does there exist predictor θ\theta'' that minimizing maxθPi[n]ui(θ)ui(θ)\max_{\theta'\in P}\sum_{i\in[n]}\frac{u_i(\theta')}{u_i(\theta'')} and not core-stable? If the answer is true, then we disproof the conjecture.

Disprove it

Consider the scenario with 2 agents and 3 feasible predictors:

agents\predictorsθ1\theta_1θ2\theta_2θ3\theta_3
1111ϵ1-\epsilon22
211221ϵ1-\epsilon

By symmetry, θ2\theta_2 and θ3\theta_3 are equivalent.

Check whether θ1\theta_1 is core-stable

  • Consider S={1}S=\{1\}:

    12u1(θ2)=1ϵ2<1=u1(θ1)=112u1(θ3)=22=11=u1(θ1)=1\frac{1}{2}u_1(\theta_2)=\frac{1-\epsilon}{2}\lt 1=u_1(\theta_1)=1\\ \frac{1}{2}u_1(\theta_3)=\frac{2}{2}=1\le 1=u_1(\theta_1)=1

  • Consider S={2}S=\{2\}:

    12u2(θ3)=1ϵ2<1=u2(θ1)=112u2(θ2)=22=11=u2(θ1)=1\frac{1}{2}u_2(\theta_3)=\frac{1-\epsilon}{2}\lt 1=u_2(\theta_1)=1\\ \frac{1}{2}u_2(\theta_2)=\frac{2}{2}=1\le 1=u_2(\theta_1)=1

  • Consider S={1,2}S=\{1,2\}:

    22u1(θ2)=1ϵ<u1(θ1)=1,22u2(θ2)=2>u2(θ1)=122u1(θ3)=2>u1(θ1)=1,22u1(θ3)=1ϵ<u1(θ1)=1\frac{2}{2}u_1(\theta_2)=1-\epsilon\lt u_1(\theta_1)=1,\frac{2}{2}u_2(\theta_2)=2\gt u_2(\theta_1)=1\\ \frac{2}{2}u_1(\theta_3)=2\gt u_1(\theta_1)=1,\frac{2}{2}u_1(\theta_3)=1-\epsilon\lt u_1(\theta_1)=1

There is no subset satisfying all scaled value is greater or equal to original value. Thus, θ1\theta_1 is core-stable.

Check whether θ2,θ3\theta_2,\theta_3 are core-stable

It suffices to check only θ2\theta_2.

  • Consider S={1}S=\{1\}:

    12u1(θ1)=12<u1(θ2)=1ϵ12u1(θ3)=1>u1(θ2)=1ϵ\frac{1}{2}u_1(\theta_1)=\frac{1}{2}\lt u_1(\theta_2)=1-\epsilon\\ \frac{1}{2}u_1(\theta_3)=1\gt u_1(\theta_2)=1-\epsilon

    Here scaled utility of θ3\theta_3 is strictly better than that of θ2\theta_2.

Thus θ2,θ3\theta_2,\theta_3 are not core-stable.

Check the Sum of Ratio

For θ1\theta_1:

i{1,2}ui(θ2)ui(θ1)=3ϵi{1,2}ui(θ3)ui(θ1)=3ϵmaxθθ1i{1,2}ui(θ)ui(θ1)3\sum_{i\in\{1,2\}}\frac{u_i(\theta_2)}{u_i(\theta_1)}=3-\epsilon\\ \sum_{i\in\{1,2\}}\frac{u_i(\theta_3)}{u_i(\theta_1)}=3-\epsilon\\ \therefore\max_{\theta'\neq\theta_1}\sum_{i\in\{1,2\}}\frac{u_i(\theta')}{u_i(\theta_1)}\approx3

For θ2\theta_2:

i{1,2}ui(θ1)ui(θ2)=12+11ϵ32i{1,2}ui(θ3)ui(θ2)=21ϵ+1ϵ252maxθθ2i{1,2}ui(θ)ui(θ2)52\sum_{i\in\{1,2\}}\frac{u_i(\theta_1)}{u_i(\theta_2)}=\frac{1}{2}+\frac{1}{1-\epsilon}\approx\frac{3}{2}\\ \sum_{i\in\{1,2\}}\frac{u_i(\theta_3)}{u_i(\theta_2)}=\frac{2}{1-\epsilon}+\frac{1-\epsilon}{2}\approx\frac{5}{2}\\ \therefore\max_{\theta'\neq\theta_2}\sum_{i\in\{1,2\}}\frac{u_i(\theta')}{u_i(\theta_2)}\approx\frac{5}{2}

By symmetry, maxθθ3i{1,2}ui(θ)ui(θ3)52\max_{\theta'\neq\theta_3}\sum_{i\in\{1,2\}}\frac{u_i(\theta')}{u_i(\theta_3)}\approx\frac{5}{2}.

The max of sum is not minimized by the only core-stable predictor.

Core-Stability in FL

CoreFed exists in Linear Regression and Logistic Regression (Classification)

Approximate core stable predictor exists in Deep Neural Network.

Existence

🤔Core-Stability exist in FL if:

  1. Utility function is continuous.

  2. The conical combination of utility functions is convex.

    α1,,αnR0n\forall\langle\alpha_1,\cdots,\alpha_n\rangle\in\mathbb R_{\ge0}^n, the set C={θi[n]αiui(θ)}C=\{\theta|\sum_{i\in[n]}\alpha_iu_i(\theta)\} is convex.

🤔Important theorem: Kakutani’s Fixed Point Theorem: A set valued function ϕ:D2D\phi:D\to2^D admits a fixed point, i.e. dD\exist d\in D, s.t. dϕ(d)d\in\phi(d), if

  1. DD is non-empty, compact, and convex

Compact: The set is closed (It must contains all its boundary points), and bounded (It can be covered by a “sphere” with certain radius)

Convex: A set is convex if, for any two points xx and yy in the set, the line segment connecting xx and yy also lies entirely within the set. Mathematically, for any x,ySx,y\in S and any t[0,1]t \in [0, 1], the point: tx+(1t)yStx+(1−t)y\in S.

  1. dD,ϕ(d)\forall d\in D,\phi(d) is non-empty, compact, and convex.
  2. ϕ()\phi(\cdot) has a closed graph. i.e., \forall sequence (di)iN(d_i)_{i\in\mathbb N} converging to dd^* and (ei)iN(e_i)_{i\in\mathbb N} converging to ee^*, s.t. diDd_i\in D and eiϕ(di)e_i\in\phi(d_i), we have eϕ(d)e^*\in\phi(d^*).

Define Core-Stable Feasible Predictors

θP\forall\theta\in P, ϕ(θ)={argmaxdPi[n]ui(d)ui(θ)}\phi(\theta)=\{\arg\max_{d\in P}\sum_{i\in[n]}\frac{u_i(d)}{u_i(\theta)}\}.

Lemma 4.1: Let θP\theta\in P be s.t. θϕ(θ)\theta\in\phi(\theta) (It is a fixed point). Then θ\theta is a core-stable predictor.

Proof by contradiction.

By Takutani’s Theorem, it suffices to show that ϕ()\phi(\cdot) has a closed graph, to ensure that it admits a fixed point.

Lemma 4.2: ϕ()\phi(\cdot) has a closed graph.

Hard proof.

🤔Theorem 1. In FL, where utilities are continuous and any conical combination of utilities is convex, a core-stable predictor exists.

⚠️If we change dPd\in P to dB(θ,r)d\in\mathcal B(\theta, r), if the two condition still holds within a radius of rr of every point. The conditions tend to be true in DNN of small rr. Then the predictor is locally core-stable.

Computation of Core-Stable Predictor

There is efficient way to compute core-stable predictor.

It is optima of convex program:

maximize L(θ)=i[n]log(ui(θ)) or i[n]ui(θ)subject to θP\text{maximize }\mathcal L(\theta)=\sum_{i\in[n]}\log(u_i(\theta))\text{ or }\prod_{i\in[n]}u_i(\theta)\\ \text{subject to }\theta\in P

The object reminds me of Nash Social Welfare😎

Utility of each agent is concave, thus above program is convex. Thus this program is concave maximization subject to convex constraint.

🤔Theorem 2. If uiu_i is concave for all ii, then any predictor θ\theta^* maximizes the convex program is core stable.

Proof by contradiction

i[n]ui(θ)ui(θ)n.\sum_{i\in[n]}\frac{u_i(\theta')}{u_i(\theta^*)}\le n.

If θ\theta^* is not core-stable, there must exist θ\theta' and subset SS, s.t.

ui(θ)nSui(θ)u_i(\theta')\ge\frac{n}{|S|}u_i(\theta^*)

for all ii with at least one strict inequality.

Thus for any θθ\theta\neq\theta',

i[n]ui(θ)ui(θ)iSui(θ)ui(θ)>n/SS=n,\sum_{i\in[n]}\frac{u_i(\theta)}{u_i(\theta')}\ge\sum_{i\in S}\frac{u_i(\theta)}{u_i(\theta')}\gt n/|S|\cdot|S|=n,

admitting contradiction.

θ\theta^* must be Pareto Optimal.

CoreFed

Stochastic Gradient Descent.

nn finite samples {(xi,yi)}i[n]\{(x_i,y_i)\}_{i\in[n]} drawn from data distribution P\mathcal P, which constitute empirical distribution P^n\hat{\mathcal P}_n. The gradient can be expressed as an conical combination of the gradients of each group:

θ(θ)=i[n]θui(θ)ui(θ)=i[n]θE(x,y)P^n(i)(fθ(x),y)MiE(x,y)P^n(i)(fθ(x),y)\nabla_\theta\mathcal(\theta)=\sum_{i\in[n]}\frac{\nabla_\theta u_i(\theta)}{u_i(\theta)}=\sum_{i\in[n]}\frac{-\nabla_\theta\mathbb E_{(x,y)\sim\hat{\mathcal P}_n^{(i)}}\ell(f_\theta(x),y)}{M_i-\mathbb E_{(x,y)\sim\hat{\mathcal P}_n^{(i)}}\ell(f_\theta(x),y)}

For each group, reweight its gradients by (MiE(x,y)P^n(i)(fθ(x),y))1(M_i-\mathbb E_{(x,y)\sim\hat{\mathcal P}_n^{(i)}}\ell(f_\theta(x),y))^{-1} and then sum up to get the final weight update. This is base protocol.

In CoreFed the model weights are directly averaged and aggregated at each iteration.

Pseudo Code of CoreFed

Input: Number of clients KK, number of rounds TT, epochs EE, learning rate η\eta.

Output: Model weights θ\theta^\top

  1. For t0t\gets0 to T1T-1 do

    • Server selects a subset of KK devices StS_t

    • Server sends weights θt\theta^t to all selected devices

    • Each select device sSts\in S_t updates θt\theta^t for EE epochs of SGD with learning rate η\eta to obtain new weights θˉst\bar\theta_s^t

    • Each select devices sSts\in S_t computes

      Δθst=θˉstθt,Lst=1Dsi=1Ds(fθt(xs(i)),ys(i))\Delta\theta_s^t=\bar\theta_s^t-\theta^t,\\ \mathcal L_s^t=\frac{1}{|\mathcal D_s|}\sum_{i=1}^{|\mathcal D_s|}\ell(f_{\theta^t}(x_s^{(i)}),y_s^{(i)})

      where Ds={(xs(i),ys(i)):i[Ds]}\mathcal D_s=\{(x_s^{(i)},y_s^{(i)}):i\in[|\mathcal D_s|]\} is the training dataset on device ss

    • Each selected device sSts\in S_t sends Δθs\Delta \theta_s and Ls\mathcal L_s back to the server

    • Server updates θt+1\theta^{t+1} following

      θt+1θt+1StsStΔθstMsLst.\theta^{t+1}\gets\theta^t+\frac{1}{|S_t|}\sum_{s\in S_t}\frac{\Delta\theta_s^t}{M_s-\mathcal L_s^t}.

end

Core-Stability in Linear Regression and Classification with Logistic Regression

Utility function is concave.

uiu_i is concave \Lrarr \ell is convex

nn finite samples {(xi,yi)}i[n]\{(x_i,y_i)\}_{i\in[n]} drawn from data distribution P\mathcal P, which constitute empirical distribution P^n\hat{\mathcal P}_n.

Linear Regression

fθ(x)=θxf_\theta(x)=\theta^\top x.

Regression loss E(x,y)P^n(i)(fθ(x),y)=E(x,y)P^n(i)(θxy)2\mathbb E_{(x,y)\sim\hat{\mathcal P}_n^{(i)}}\ell(f_\theta(x),y)=\mathbb E_{(x,y)\sim\hat{\mathcal P}_n^{(i)}}(\theta^\top x-y)^2 would be convex in θ\theta

🤔Lemma: CoreFed determines a core-stable predictor in a federated learning setting training linear regression

Classification with Logistic Regression

Given a classifier θ\theta and a scalar cRc\in R, and agents ii’s loss is i(θ,c)=θ22+αi[n]log(exp(yi(θxi+c))+1)\ell_i(\theta,c)=\frac{||\theta||_2}{2}+\alpha\cdot\sum_{i\in[n]}\log(\exp(-y_i(\theta^\top x_i+c))+1). i\ell_i is convex.

Thus, ui(θ,c)=Mii(θ,c)u_i(\theta,c)=M_i-\ell_i(\theta,c) where Mi=argmaxθP,cR(i(θ,c))M_i=\arg\max_{\theta\in P,c\in\mathbb R}(\ell_i(\theta,c)) is concave.

🤔Lemma: CoreFed determines a core-stable predictor in a federated learning setting training classification with logistic regression.

Approximate Core-Stability in DNN

Skip them!