I will begin a new project work 24 Fall: FAIRNESS IN AI/ML THROUGH THE LENS OF SOCIAL CHOICE
The basic outline is following Tutorial: Fairness in AI/ML via Social Choice by Evi Micha & Nisarg Shah in EC 2024.
The first paper I have read in last semester, Fairness in Federated Learning via Core-Stability, which was one of the possible topics.
Brief notes is here.
I will try to understand some mathematical proof in this semester.
Core-Stability
Revisit the definition of Core-Stability: A predictor θ∈P is called core stable if there exist no other θ′∈P and no subset S of agents, s.t. n∣S∣ui(θ′)≥ui(θ) for all i∈S, with at least one strict inequality.
⚠️⚠️⚠️What is WRONG with the claim
Setting S=[n] in definition, there is no predictor θ′∈P where ∑i∈[n]ui(θ)ui(θ′)>n.
According to the definition, when we plug in S=[n], there exist no θ′∈P s.t.
ui(θ′)≥ui(θ)∀i∈S⇔ui(θ)ui(θ′)≥1∀i∈S
with at least one strict inequality.
The problem is, can we directly sum it up, then we have ∑iui(θ)ui(θ′)>n???
There is a counterexample:
n=2, θ is core-stable, we only have two feasible predictors θ,θ′:θ′=θ, ϵ→0+.
| agent 1 | agent 2 |
---|
θ | 1 | 1 |
θ′ | 2−ϵ | 1−ϵ |
Verifying the core-stability of θ:
- For S={1}, we have n∣S∣ui(θ′)=21u1(θ′)=1−2ϵ<1=u1(θ). Which satisfy the condition.
- For S={2}, we have 21u2(θ′)=21−ϵ<1=u2(θ). Which satisfy the condition.
- For S={1,2}, we have u1(θ′)=2−ϵ>1=u1(θ) and u2(θ′)=1−ϵ<1=u2(θ). Which also satisfy the condition that some scaled utility is less than the original utility.
But ∑iui(θ)ui(θ′)=u1(θ)u1(θ′)+u2(θ)u2(θ′)=3−2ϵ>2!
Scaled factor ∣S∣/n suggests some kind of proportionality here. Basically core-stability means there is no big benefit for agent to form a coalition.
Core means “no deviating subgroup”
Core stability is very strong, only exists in special cases.
Advantage
More fair than FedAvg. M is a very large integer.
| Agent 1 | Agent 2 | Agent 3 |
---|
θ1 | a | a | M⋅a |
θ2 | 10a | 10a | 0.9M⋅a |
In FedAvg, θ2 is preferred, but not fair to Agent 1 and 2. (They have incentive to collide.)
Robustness to poor local data quality of agents.
❗Scaling the utility of any single agents does not change the core-stable allocation.
❗Some Ideas around Sum-of-ratio Property
Is there scenario s.t. none of predictors is core-stable?
Yes. The example
| agent 1 | agent 2 |
---|
θ | 1 | 2+ϵ |
θ′ | 2+ϵ | 1 |
shows that none of θ,θ′ is core-stable.
Some New Definitions
Ideas from Iannis. I was amazed by his ability to invent some new notions 😮
📖Definition of pessimistic price of core-stability:
PoCSpessimistic=θ is core-stablemaxθ′∈Pmaxi∈[n]∑ui(θ)ui(θ′)
📖Definition of optimistic price of core-stability:
PoCSoptimistic=θ is core-stableminθ′∈Pmaxi∈[n]∑ui(θ)ui(θ′)
Conjecture about Price of Core-Stability (Has been disproved)
If the set of core-stable predictors is non-empty, does one predictor in the set be the optimistic price of core-stability?
We want to proof/disproof this conjectures.
Let’s consider the scenario with two agents.
agents\predictors | θ | … | θ′′ | … |
---|
1 | 1 | … | ? | … |
2 | 1 | … | ? | … |
Assume that θ is core-stable. The question is, does there exist predictor θ′′ that minimizing maxθ′∈P∑i∈[n]ui(θ′′)ui(θ′) and not core-stable? If the answer is true, then we disproof the conjecture.
Disprove it
Consider the scenario with 2 agents and 3 feasible predictors:
agents\predictors | θ1 | θ2 | θ3 |
---|
1 | 1 | 1−ϵ | 2 |
2 | 1 | 2 | 1−ϵ |
By symmetry, θ2 and θ3 are equivalent.
Check whether θ1 is core-stable
Consider S={1}:
21u1(θ2)=21−ϵ<1=u1(θ1)=121u1(θ3)=22=1≤1=u1(θ1)=1
Consider S={2}:
21u2(θ3)=21−ϵ<1=u2(θ1)=121u2(θ2)=22=1≤1=u2(θ1)=1
Consider S={1,2}:
22u1(θ2)=1−ϵ<u1(θ1)=1,22u2(θ2)=2>u2(θ1)=122u1(θ3)=2>u1(θ1)=1,22u1(θ3)=1−ϵ<u1(θ1)=1
There is no subset satisfying all scaled value is greater or equal to original value. Thus, θ1 is core-stable.
Check whether θ2,θ3 are core-stable
It suffices to check only θ2.
Thus θ2,θ3 are not core-stable.
Check the Sum of Ratio
For θ1:
i∈{1,2}∑ui(θ1)ui(θ2)=3−ϵi∈{1,2}∑ui(θ1)ui(θ3)=3−ϵ∴θ′=θ1maxi∈{1,2}∑ui(θ1)ui(θ′)≈3
For θ2:
i∈{1,2}∑ui(θ2)ui(θ1)=21+1−ϵ1≈23i∈{1,2}∑ui(θ2)ui(θ3)=1−ϵ2+21−ϵ≈25∴θ′=θ2maxi∈{1,2}∑ui(θ2)ui(θ′)≈25
By symmetry, maxθ′=θ3∑i∈{1,2}ui(θ3)ui(θ′)≈25.
The max of sum is not minimized by the only core-stable predictor.
Core-Stability in FL
CoreFed exists in Linear Regression and Logistic Regression (Classification)
Approximate core stable predictor exists in Deep Neural Network.
Existence
🤔Core-Stability exist in FL if:
Utility function is continuous.
The conical combination of utility functions is convex.
∀⟨α1,⋯,αn⟩∈R≥0n, the set C={θ∣∑i∈[n]αiui(θ)} is convex.
🤔Important theorem: Kakutani’s Fixed Point Theorem: A set valued function ϕ:D→2D admits a fixed point, i.e. ∃d∈D, s.t. d∈ϕ(d), if
- D is non-empty, compact, and convex
Compact: The set is closed (It must contains all its boundary points), and bounded (It can be covered by a “sphere” with certain radius)
Convex: A set is convex if, for any two points x and y in the set, the line segment connecting x and y also lies entirely within the set. Mathematically, for any x,y∈S and any t∈[0,1], the point: tx+(1−t)y∈S.
- ∀d∈D,ϕ(d) is non-empty, compact, and convex.
- ϕ(⋅) has a closed graph. i.e., ∀ sequence (di)i∈N converging to d∗ and (ei)i∈N converging to e∗, s.t. di∈D and ei∈ϕ(di), we have e∗∈ϕ(d∗).
Define Core-Stable Feasible Predictors
∀θ∈P, ϕ(θ)={argmaxd∈P∑i∈[n]ui(θ)ui(d)}.
Lemma 4.1: Let θ∈P be s.t. θ∈ϕ(θ) (It is a fixed point). Then θ is a core-stable predictor.
Proof by contradiction.
By Takutani’s Theorem, it suffices to show that ϕ(⋅) has a closed graph, to ensure that it admits a fixed point.
Lemma 4.2: ϕ(⋅) has a closed graph.
Hard proof.
🤔Theorem 1. In FL, where utilities are continuous and any conical combination of utilities is convex, a core-stable predictor exists.
⚠️If we change d∈P to d∈B(θ,r), if the two condition still holds within a radius of r of every point. The conditions tend to be true in DNN of small r. Then the predictor is locally core-stable.
Computation of Core-Stable Predictor
There is efficient way to compute core-stable predictor.
It is optima of convex program:
maximize L(θ)=i∈[n]∑log(ui(θ)) or i∈[n]∏ui(θ)subject to θ∈P
The object reminds me of Nash Social Welfare😎
Utility of each agent is concave, thus above program is convex. Thus this program is concave maximization subject to convex constraint.
🤔Theorem 2. If ui is concave for all i, then any predictor θ∗ maximizes the convex program is core stable.
Proof by contradiction
i∈[n]∑ui(θ∗)ui(θ′)≤n.
If θ∗ is not core-stable, there must exist θ′ and subset S, s.t.
ui(θ′)≥∣S∣nui(θ∗)
for all i with at least one strict inequality.
Thus for any θ=θ′,
i∈[n]∑ui(θ′)ui(θ)≥i∈S∑ui(θ′)ui(θ)>n/∣S∣⋅∣S∣=n,
admitting contradiction.
θ∗ must be Pareto Optimal.
CoreFed
Stochastic Gradient Descent.
n finite samples {(xi,yi)}i∈[n] drawn from data distribution P, which constitute empirical distribution P^n. The gradient can be expressed as an conical combination of the gradients of each group:
∇θ(θ)=i∈[n]∑ui(θ)∇θui(θ)=i∈[n]∑Mi−E(x,y)∼P^n(i)ℓ(fθ(x),y)−∇θE(x,y)∼P^n(i)ℓ(fθ(x),y)
For each group, reweight its gradients by (Mi−E(x,y)∼P^n(i)ℓ(fθ(x),y))−1 and then sum up to get the final weight update. This is base protocol.
In CoreFed the model weights are directly averaged and aggregated at each iteration.
Pseudo Code of CoreFed
Input: Number of clients K, number of rounds T, epochs E, learning rate η.
Output: Model weights θ⊤
For t←0 to T−1 do
Server selects a subset of K devices St
Server sends weights θt to all selected devices
Each select device s∈St updates θt for E epochs of SGD with learning rate η to obtain new weights θˉst
Each select devices s∈St computes
Δθst=θˉst−θt,Lst=∣Ds∣1i=1∑∣Ds∣ℓ(fθt(xs(i)),ys(i))
where Ds={(xs(i),ys(i)):i∈[∣Ds∣]} is the training dataset on device s
Each selected device s∈St sends Δθs and Ls back to the server
Server updates θt+1 following
θt+1←θt+∣St∣1s∈St∑Ms−LstΔθst.
end
Core-Stability in Linear Regression and Classification with Logistic Regression
Utility function is concave.
ui is concave ⇔ℓ is convex
n finite samples {(xi,yi)}i∈[n] drawn from data distribution P, which constitute empirical distribution P^n.
Linear Regression
fθ(x)=θ⊤x.
Regression loss E(x,y)∼P^n(i)ℓ(fθ(x),y)=E(x,y)∼P^n(i)(θ⊤x−y)2 would be convex in θ
🤔Lemma: CoreFed determines a core-stable predictor in a federated learning setting training linear regression
Classification with Logistic Regression
Given a classifier θ and a scalar c∈R, and agents i’s loss is ℓi(θ,c)=2∣∣θ∣∣2+α⋅∑i∈[n]log(exp(−yi(θ⊤xi+c))+1). ℓi is convex.
Thus, ui(θ,c)=Mi−ℓi(θ,c) where Mi=argmaxθ∈P,c∈R(ℓi(θ,c)) is concave.
🤔Lemma: CoreFed determines a core-stable predictor in a federated learning setting training classification with logistic regression.
Approximate Core-Stability in DNN
Skip them!