Photo by

1. Hierarchical Clustering Approach

  • A typical clustering analysis approach via partitioning data set sequentially
  • Construct nested partitions layer by layer via grouping objects into a tree of clusters (without the need to know the number of clusters in advance)
  • Use (generalized) distance matrix as clustering criteria.

2. Strategies:Agglomerative Vs. Divisive

Two sequential clustering strategies for constructing a tree of clusters

  • Agglomerative: a bottom-up strategy
    • Initially each data object is in its own (atomic) cluster
    • Then merge these atomic clusters into larger and larger clusters
  • Divisive: a top-down strategy
    • Initially all objects are in one single cluster
    • Then the cluster is subdivided into smaller and smaller clusters

3. Hierarchical Clustering

The number of dendrograms with n: \(\text{leafs}=\frac{(2n-3)!}{[2^{(n-2)}(n-2)!]}\)

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

4. Compute distances between clusters

  • Single Link: smallest distance between an element in one cluster and an element in the other. e.g.,
  • Complete Link: largest distance between an element in one cluster and an element in the other, i.e.,
  • Average: avg distance between elements in one cluster and elements in the other, i.e.,

5. Summary

hierarchical algorithm is a sequential clustering algorithm

  • use distance matrix to construct a tree of clusters (dendrogram)
  • hierarchical representation without the need of knowing # of clusters

major weakness of agglomerative clustering methods

  • can never undo what was done previously
  • sensitive to cluster distance measures and noise/outliers
  • less efficient: \(O(n^2\log n)\), where n is the number of total objects

6. Comparison with k-means

  • kmeans clustering produces a single partitioning
  • hierarchical clustering can give different partitionings depending on the level-of-resolution we are looking at
  • kmeans clustering needs the number of clusters to be specified
  • hierarchical clustering doesn’t need the number of clusters to be specified
  • kmeans clustering is usually more efficient run-time wise
  • hierarchical clustering can be slow (has to make several merge/split decisions)
  • no clear consensus on which of the two produces better clustering