How to Find the Centroid in a Clustering Analysis

Clusters are groups of data that have similar characteristics.
••• Creatas/Creatas/Getty Images

Cluster analysis is a method of organizing data into representative groups based upon similar characteristics. Each member of the cluster has more in common with other members of the same cluster than with members of the other groups. The most representative point within the group is called the centroid. Usually, this is the mean of the values of the points of data in the cluster.

    Organize the data. If the data consists of a single variable, a histogram might be appropriate. If two variables are involved, graph the data on a coordinate plane. For example, if you were looking at the height and weight of school children in a classroom, plot the points of data for each child on a graph, with the weight being the horizontal axis and the height being the vertical axis. If more than two variables are involved, matrices may be needed to display the data.

    Group the data into clusters. Each cluster should consist of the points of data closest to it. In the height and weight example, group any points of data that appear to be close together. The number of clusters, and whether every point of data has to be in a cluster, may depend upon the purposes of the study.

    For each cluster, add the values of all members. For example, if a cluster of data consisted of the points (80, 56), (75, 53), (60, 50), and (68,54), the sum of the values would be (283, 213).

    Divide the total by the number of members of the cluster. In the example above, 283 divided by four is 70.75, and 213 divided by four is 53.25, so the centroid of the cluster is (70.75, 53.25).

    Plot the cluster centroids and determine whether any points are closer to a centroid of another cluster than they are to the centroid of their own cluster. If any points are closer to a different centroid, redistribute them to the cluster containing the closer centroid.

    Repeat Steps 3, 4 and 5 until all points of data are in the cluster containing the centroid to which they are closest.

    Things You'll Need

    • Calculator
    • Graph paper

    Tips

    • If the centroid has to be a particular point of data instead of a midpoint between the data, then the median may be used to determine it, instead of the mean.

Related Articles

How to Approximate the Mean of Group Data
How to Calculate a T-Score
How Do You Find a Cluster in a Line Plot?
How to Calculate Skew
Can You Use a T-Test on Ranked Data?
How to Calculate the Slope of a Line of Best Fit
How to Make a Cumulative Probability Curve
How to Draw a Dendrogram
How to Calculate a Confidence Interval
How to Find Quadratic Equations From a Table
What Are Gaps, Clusters & Outliers in Math?
How to Determine the Assumed Mean
How to Write a Linear Regression Equation
How to Find the Mean, Median, Mode, Range, and Standard...
What Is PPS Sampling?
How to Calculate Statistical Mean
How to Calculate Relative Standard Error
How to Calculate Statistical Difference
What Does a Negative T-Value Mean?
How to Determine the Y-Intercept of a Trend Line