Since October is Breast Cancer Awareness Month, I wanted to do some type of analysis using a breast cancer dataset. There is a popular dataset available on Kaggle and the UCI Machine Learning Repository that contains diagnosis data which looked like a good contender for an interesting analysis.

The data used in the analysis were originally provided by the University of Wisconsin. They are numerical variables which were obtained based on digitized images of a fine needle aspirate of a breast mass and describe cell nuclei characteristics present in the image.

Each observation is a nucleus which is described by 30 features (10 original features — with 3 metrics for each: mean, standard error, and worst (average of largest 3):

radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter² / area — 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
symmetry
fractal dimension (“coastline approximation” — 1)

Each observation has a label of M (Malignant) or B (Benign).

This is a popular dataset, and many analyses already exist that focus on using features to classify observations as benign or malignant. Given the plethora of similar analyses, I wanted to do something different. In this analysis, clustering (k-means) is used to create two clusters based on the 30 features in an attempt to naturally segment the data into a benign and malignant segment.

Normalization

The first step in the analysis is to normalize all features in order to put them on the same scale. Without doing this, the clustering algorithm will prioritize features due to a larger scale (e.g., perimeter and smoothness have completely different scales and perimeter will be treated as more important just due to the difference in scale).

There are many ways to normalize data, but my preference is “squeezing” each feature to a value between 0 and 1. This is done by subtracting the minimum value in a feature for each observation and dividing by the range of values of the feature (maximum — minimum). The downside of normalizing is that it makes interpretation more difficult because the values being used are not on the original scale. We can save the minimum and maximum values so we can later transform the data back to its original scale. This is done by multiplying the normalized value by the range of the original values and then adding back the original minimum.

Clustering

Typically, the first step in clustering after preparing your data is to choose how many clusters are ideal to use. A scree/elbow plot is useful for this and shows you an ideal number of clusters solely from an error perspective. It’s a good start, but it needs to be paired with some knowledge of the problem. In our case, since there are only two outcomes (benign vs malignant), it does not make sense to use more than two clusters.

Before clustering, it’s important to remove the variable that lists the diagnosis of the observation. The goal is to see if we can segment the data by diagnosis using the characteristics of the nuclei, and we do not want the actual diagnoses to influence the results.

Running the clustering algorithm for two clusters results in the following breakdown of cluster sizes:

Roughly ~200 observations were bucketed in Cluster 1, while about ~400 were bucketed in Cluster 2. Next, we can look at our primary aim, and that is whether the clustering was able to segment diagnoses into benign and malignant.

The above plot looks at the percent of observations in each cluster that have a malignant diagnosis. This shows that the clustering was able to segment the diagnoses without even seeing them very well. Cluster 1’s observations are ~95% malignant cases, while Cluster 2’s observations are only ~8% malignant cases.

This is impressive because we took out the diagnosis variable before clustering, so this is done solely with the other features.

Important Features

Finally, we can look at some of the important features by diving into the centers of the clusters. First, it would be useful to transform the features back into their original scale.

Overall, feature centers are much higher in Cluster 1, and lower in Cluster 2. Typically, the data shows that larger values are worse, and smaller values are better. This makes sense since cancer is more visible if it is larger. However, it would still be interesting to look at some of the highest feature centers for Cluster 1 and see what is more indicative of a malignant vs benign diagnosis.

The highest three centers in the first cluster are concave.points_worst, perimeter_mean, and radius_mean. Below are the values in their original scale for these three features for both clusters. This can let us compare the difference between the two.

Here we can see the differences between the two clusters based on the largest three centers in the first cluster. On average, in the first cluster the largest concave points are more than double when compared to the second cluster, the perimeter is ~1.5 times larger, and the average mean radius in the first cluster is also quite a bit higher than in the second cluster.

This gives us an idea of what type of features matter when making a diagnosis, and while it seems obvious that larger nuclei are more likely malignant, it is still good to see it represented by the algorithm.

This analysis used clustering in a semi-supervised way. Typically, clustering is used solely for unsupervised learning where there are no given labels and the goal is to use features of data to cluster into segments. However, in this case, after performing the segmentation, the labels are brought into the analysis and used to see how well the data was segmented and the importance of features.

This type of approach has been successfully used at CompassRed, and it is a helpful technique which allows use of attributes not part of the clustering process to help label segments (as done here by labeling the first cluster as malignant and the second cluster as benign).

The healthcare space is becoming a popular field for data science, and there is a lot that can be done to improve people’s lives.