Clustering

Clustering is an unsupervised learning approach that "clusters" data points based on the type of clustering technique. There are many varieties of clustering out there, such as hierarchical and k-means.

Common Applications

Common Industries

Agriscience
Healthcare
Marketing
Tech/Social Media

Common Problem Types

Anomaly Identification
Market Segmentation
Genetic/Biological Analysis
Recommender Systems

Code Examples

All of the code examples are written in Python, unless otherwise noted.

Containers

These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you’ll need to run it. Click here to learn why you should be using containers, along with how to do so.

#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:k-means-clustering

#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:k-means-clustering

Need help implementing any of this code? Feel free to reach out to datamine-help@purdue.edu and we can help!

Resources

All resources are chosen by Data Mine staff to be of decent quality, and most if not all content is free.

Websites

What is Clustering? (Google)

Hierarchical Clustering (W3 Schools)

What is Cluster Analysis (MathWorks)

Videos

4 Basic Types of Cluster Analysis used in Data Analytics (~9 minutes)

Clustering (~16 minutes)

K-means Clustering (~8 minutes)

Hierarchical Clustering (~16 minutes)

Books

An Introduction to Clustering With R (2020)

Introduction to Statistical Learning (using Python and R), see Chapter 12.4 (2022)

Partitional Clustering Via Nonsmooth Optimization: Clustering Via Optimization (2020)

Articles

Survey on Hierarchical Clustering for Machine Learning (2023)

Classifying Patients Operated for Degenerative Lumbar Spondylolisthesis: A Machine-Learning Clustering Analysis to Identify Patterns of Clinical Presentation (2020)

Multi-assignment clustering: Machine learning from a biological perspective (2021)