硕士论文就跟 Non-Centroid Clustering 杠上了。我和导师都希望能

  • 探索 Split and Merge 算法的更多性质
  • 扩展 Split and Merge 算法到更一般的情况

仍然只考虑:

  1. binary distance

    the distance of each pair of data point is either 11 or 22.

  2. non-metric

    no restriction of triangle inequality

  3. asymmetric distance

    dis(a,b)dis(b,a)dis(a,b)\neq dis(b,a)

Equivalent Setting

Those three conditions is equivalent to APPROVAL VOTING.

  • Each data point is a voter.
  • Each voter approve a subset of other data points

Cost Definition

The maximum and average costs are therefore defined equivalently and accordingly.

The maximum cost:

cost(i)={1 if jCi,i doesn’t approve j0 otherwisecost(i)=\begin{cases} 1\text{ if }\exist j\in C_i,i\text{ doesn't approve }j\\ 0\text{ otherwise} \end{cases}

The minimum cost:

cost(i)=#jCi, s.t. i doesn’t approveCicost(i)=\frac{\#j\in C_i\text{, s.t. }i\text{ doesn't approve}}{|C_i|}

It’s not hard to prove that Split and Merge algorithm also computes a clustering in the core with BINARY, ASYMMETRIC, NON-METRIC “distance”.

On Average Cost

🎯Theorem (On Average Cost): Split and Merge algorithm computes a clustering in n/k\lceil n/k\rceil-core.

Assume there exist a subset VV of n/k\lceil n/k\rceil agents where each agent would improve more than multiplicative factor n/k\lceil n/k\rceil.

In Split and Merge algorithm, the average cost of data point ii is at most Ci1Ci\frac{|C_i|-1}{|C_i|}.

Therefore, for each subset of n/k\lceil n/k\rceil, there exist one agent does not improve its cost (from 11 to 00). (Consider maximum cost, Split and Merge computes the solution in the core, which means for each subset of n/k\lceil n/k\rceil, not all agent would be strictly better off.)

Thus, the average cost of this agent is at most 11. Possible improved average cost is at least 1n/k\frac{1}{\lceil n/k\rceil}, meaning that the improvement can not be greater than factor of n/k\lceil n/k\rceil.

Can we improve this n/k\lceil n/k\rceil factor? No idea… 🤔

Following Directions on Average Cost

From the paper, we learn that for symmetric, metric, arbitrary distance, the lower bound of core approximation factor is 1.31.3.

🤔But what if we consider binary distance remove symmetric and metric limitation? Could the core be non-empty in this case?

🤔What about graphs with certain structures? (scale-free, small-world networks…)