Hierarchical Clustering in Data Mining

Hierarchy is more informative structure rather than the unstructured set of clusters returned by non hierarchical clustering.

Basic algorithm:
  • Compute the proximity (similarity) matrix.
  • Let each data point be cluster.
  • Merge the two closest clusters.
  • Update the proximity matrix until only one cluster remains.
Different approaches to define the cluster between the clusters.

1. Single linkage
In this algorithm, the pair of clusters having shortest distance is considered, if  there exists the similarity between two clusters.

2. Complete linkage
In this method, the distance between one cluster and another cluster should be equal to the greatest distance from any member of one cluster to any member of the other cluster.

3. Average linkage
In this method, the distance between one cluster and another cluster should be equal to average distance from any member of one cluster to any member of the other cluster.

Dendrogram

It is a tree structure diagram which illustrates hierarchical clustering techniques. Each level shows clusters for that level.

dendrogram

Agglomerative hierarchical clustering

  • It is a bottom-up approach, in which clusters have  sub-clusters.
  • The process is explained in the following flowchart.
agglomerative hierarchical clustering

Example: For given distance matrix, draw single link, complete link and average link dendrogram.

ABCDE
A0
B20
C630
D11970
E98540

Solution: Single link: From given distance matrix.

Step 1:

A, BCDE
A, B0
C30
D970
E8540

Step 2:

A, B, CDE
A, B, C00
D74
E500

Step 3:

A, B, CD, E
A, B, C0
D, E50

single link dendogramm

2. Complete link: Find maximum distance to draw complete link.

Step 1:

A, BCDE
A, B0
C60
D1170
E9540

Step 2:

A, BCD, E
A, B0
C60
D, E1170

Step 3:

A, B, CD, E
A, B, C0
D, E110

complete link dendogramm

3. Average link: To draw average link, compute the average link.

Step 1:
I)  6+3/2 = 4.5
ii) 11+9/2 = 10
iii)  9+8/2 = 8.5

A, BCDE
A, B0
C4.5
D1070
E8.5540

Step 2:
I) 11+9+9+8/2 = 9.25
II) 7+5/ 2 = 6

A, BCD, E
A, B0
C4.50
D, E9.2560

Step 3:
11+9.25+9.25+8+7+5/6 = 8.20

A, B, CD, E
A, B, C0
D, E8.200

Comparison between Single link, Complete link and Average link based on distance formula.

Single linkComplete linkAverage link
Handles  non-elliptical shapes.
It is sensitive to noise and outliers.
Less sensitive to noise and outliers.It strikes a balance between single and complete links.

It strikes a balance between single and complete links.