硕士论文就跟 Non-Centroid Clustering 杠上了。我和导师都希望能

  • 探索 Split and Merge 算法的更多性质
  • 扩展 Split and Merge 算法到更一般的情况

仍然只考虑:

  1. binary distance

    the distance of each pair of data point is either 11 or 22.

  2. non-metric

    no restriction of triangle inequality

  3. asymmetric distance

    dis(a,b)dis(b,a)dis(a,b)\neq dis(b,a)

Equivalent Setting

Those three conditions is equivalent to APPROVAL VOTING.

  • Each data point is a voter.
  • Each voter approve a subset of other data points

Cost Definition

The maximum and average costs are therefore defined equivalently and accordingly.

The maximum cost:

cost(i)={1 if jCi,i doesn’t approve j0 otherwisecost(i)=\begin{cases} 1\text{ if }\exist j\in C_i,i\text{ doesn't approve }j\\ 0\text{ otherwise} \end{cases}

The minimum cost:

cost(i)=#jCi, s.t. i doesn’t approveCicost(i)=\frac{\#j\in C_i\text{, s.t. }i\text{ doesn't approve}}{|C_i|}

It’s not hard to prove that Split and Merge algorithm also computes a clustering in the core with BINARY, ASYMMETRIC, NON-METRIC “distance”.

On Average Cost

🎯Theorem (On Average Cost): Split and Merge algorithm computes a clustering in n/k\lceil n/k\rceil-core.

Assume there exist a subset VV of n/k\lceil n/k\rceil agents where each agent would improve more than multiplicative factor n/k\lceil n/k\rceil.

In Split and Merge algorithm, the average cost of data point ii is at most Ci1Ci\frac{|C_i|-1}{|C_i|}.

Therefore, for each subset of n/k\lceil n/k\rceil, there exist one agent does not improve its cost (from 11 to 00). (Consider maximum cost, Split and Merge computes the solution in the core, which means for each subset of n/k\lceil n/k\rceil, not all agent would be strictly better off.)

Thus, the average cost of this agent is at most 11. Possible improved average cost is at least 1n/k\frac{1}{\lceil n/k\rceil}, meaning that the improvement can not be greater than factor of n/k\lceil n/k\rceil.

Can we improve this n/k\lceil n/k\rceil factor? No idea… 🤔

Breakthrough: A Variation of Split and Merge Computes Better Approximation w.r.t. Average Cost

Non-metric, asymmetric and binary distance.

Average cost function.

  1. Construct directed graph.

  2. Split phase: Do iteratively

    • If there exist n/k\lceil n/k\rceil-clique, put them in one cluster and discard them.

    • else if there exist such a cluster, put then in one cluster and discard them:

      The largest group of points s.t. no points is at distance 11 from n/k\sqrt{\lceil n/k\rceil} other points.

      // This step is as hard as MAXCLIQUE.

  3. Merge phase: Pick the smallest clusters, merge them in order to have at most kk clusters.

🎯Theorem (n/k\sqrt{\lceil n/k\rceil}-core w.r.t. average cost): This variation outputs a clustering solution in n/k\sqrt{\lceil n/k\rceil}-core.

Assume there exist a subset VV of n/k\lceil n/k\rceil agent improving more than n/k\sqrt{\lceil n/k\rceil}. In the variation, average cost of an agent ii is Ci1Ci<1\le\frac{|C_i|-1}{|C_i|}\lt1.

For each subset of n/k\lceil n/k\rceil agents, there exist one agent having average cost n/kn/k\le\frac{\sqrt{\lceil n/k\rceil}}{\lceil n/k\rceil}. (By the greedy property of second level of split phase)

The possible improved average cost is at least 1n/k\frac{1}{\lceil n/k\rceil}. Meaning that improvement can not be more than n/k\sqrt{\lceil n/k\rceil}, this leads to contradiction.

Following Directions on Average Cost

From the paper, we learn that for symmetric, metric, arbitrary distance, the lower bound of core approximation factor is 1.31.3.

🤔But what if we consider binary distance remove symmetric and metric limitation? Could the core be non-empty in this case?

🤔What about graphs with certain structures? (scale-free, small-world networks…)