Explainable Clustering — The Introduction of Recursive Embedding and Clustering and its Application (2024)

1. Introduction of Recursive embedding and clustering method

Spotify has around 600 million users worldwide. A deeper understanding of users is essential to provide a better user experience, but because of its popularity, it is challenging. Moreover, although user segmenting by clustering is one of the effective methods, also because of its popularity, it is difficult to explain or examine why the cluster exists. Therefore, Spotify developed the explainable clustering method to find significant clusters and explain why those clusters exist. The overall architecture is shown below.

Explainable Clustering — The Introduction of Recursive Embedding and Clustering and its Application (1)

It mainly comprises two stages.

  1. Find the low-dimensional representation from the original data using UMAP and cluster it using HDBSCAN. Examine the clusters’ characteristics.
  2. Train XGBoost using the raw data as input and labels that HDBSCAN classified as output and understand the feature contribution using SHAP.

Let’s dive into it step by step.

For the first stage, we utilize UMAP and HDBSCAN to construct the understandable clusterings. Let’s quickly recap about UMAP and HDBSCAN.

1.1 UMAP

Explainable Clustering — The Introduction of Recursive Embedding and Clustering and its Application (2)

Uniform Manifold Approximation and Projection (UMAP) is one of the non-linear dimension reduction algorithms. UMAP learns the manifold structure in the input data’s high-dimensional space and finds the low-dimensional representation, preserving the manifold structure as much as possible. It is classified in the same category as T-SNE. One of the advantages of UMAP is that it is more robust in preserving the high-dimensional structure than T-SNE because of the loss function (For your information, T-SNE uses KL divergence, and UMAP uses Cross-entropy loss). You can see more details in this blog [3].

Thus, we utilize UMAP as a dimension reduction algorithm to reduce the dimension of input data to 2 or 3 dimensions.

1.2 HDBSCAN

Explainable Clustering — The Introduction of Recursive Embedding and Clustering and its Application (3)

Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a density-based clustering method. It is fast and robust and finds meaningful clusters if there are any. The algorithm overview from the official guide is below.

  1. Transform the space according to the density/sparsity.
  2. Build the minimum spanning tree of the distance-weighted graph.
  3. Construct a cluster hierarchy of connected components.
  4. Condense the cluster hierarchy based on minimum cluster size.
  5. Extract the stable clusters from the condensed tree.

Therefore, we utilize HDBSCAN to find dense clusters based on the UMAP representation. We can assign labels using HDBSCAN and check the features of clusters by examining the raw data.

In summary, we conduct two processes in the first stage.

  1. Condense the high-dimensional information of input data to low-dimensional representation using UMAP.
  2. Find clusters using low-dimensional data by UMAP.

After processing this procedure, we can visualize the clusters and check the details using the raw data.

1.3 XGBoost + SHAP

XGBoost is a popular gradient-boosting tree algorithm. It solves many data science problems quickly and accurately. On the other hand, SHAP is one of the explainable AI methods for any machine learning algorithm. We can check the contribution of each feature using SHAP. Since we don’t want to proceed to the first stage many times, which is time-consuming, we train the XGBoost model using the high-dimensional data as input and HDBSCAN labels as output in the second stage. Note that we use XGBoost as a one-versus-all model for each cluster to allow very fast, accurate training and a precise understanding of the feature’s importance. The illustration of this process is as follows:

Explainable Clustering — The Introduction of Recursive Embedding and Clustering and its Application (4)

We train multiple XGBoost classifiers for each label and can get the detailed importance of the feature using SHAP.

So far, we have gone through the algorithmic perspective of this method. In the next section, let’s see the power of this method using real-world data!

2. Real-world data Application: E-commerce-Users of a French C2C fashion store dataset

In the last section, I will guide you through applying recursive embedding and clustering methods using the TikTok user profile dataset [4]. As its name suggests, the TikTok user profile dataset contains the user information, and each row represents one user’s information. Let’s implement it step by step.

As the first step, you need to set an environment. I used ubuntu20.04 and Python3.10 environment. Firstly, I create the virtual environment using conda.

conda create --name rclustering python==3.10 -y
conda activate sam
conda install pip
## optional: To avoid install libraries on the local environment,
## check the which pip will be used to store libraries
which pip
# I use /opt/conda/envs/rclustering/bin/pip in my enviornment.

Next, you need to install the following libraries.

conda install -c conda-forge umap-learn=0.5.6 -y
conda install -c plotly plotly=5.23.0 -y
conda install -c conda-forge pandas=2.2.2 -y
conda install -c conda-forge nbformat -y
conda install -c conda-forge py-xgboost -y
conda install -c conda-forge shap -y
conda install -c conda-forge matplotlib

We’ve finished to set an environment, so let’s see the TikTok user profile dataset. I will attach the whole code that I used later.

The TikTok user profile dataset contains 1,000 rows and 19 columns. The first five rows are shown below.

df = pd.read_csv('../TikTok profiles dataset (Public web data).csv')
df.head()
Explainable Clustering — The Introduction of Recursive Embedding and Clustering and its Application (5)

The dataset contains the following columns besides the unrelated columns in this setting, such as timestamp, biolink, create_time, id, top_videos, URL, and profile_pic_url.

  • account_id : Unique identifier for each account.
  • nickname : account name.
  • biography : account’s biography.
  • awg_engagement_rate : average engagement rate.
  • comment_engagement_rate
  • like_engagement_rate
  • is_verified : the verified status on TikTok.
  • followers : the number of followers.
  • following : the number of following creators.
  • likes : the number of likes the user have obtained.
  • videos_count : the number of videos.

The dataset distribution is shown below.

fig = make_subplots(rows=3, cols=2)
viz_columns = ['awg_engagement_rate', 'comment_engagement_rate', 'like_engagement_rate', 'followers', 'following', 'videos_count']

for i in range(len(viz_columns)):
x = i % 2 + 1
y = i // 2 + 1

col = viz_columns[i]
values = df[col]

fig.add_trace(go.Histogram(x=values, name=viz_columns[i]), row=y, col=x)
fig.show()

Explainable Clustering — The Introduction of Recursive Embedding and Clustering and its Application (6)

As you can see, the distribution of all features tends to be right-skewed, and each feature’s scale varies.

For the next step, we need to preprocess data. In this case, we can drop account_id, nickname, and biography for simplicity. If you want to use nicknames and biographies, such as text information, we need to utilize the language model to encode this information. Moreover, we have one categorical variable, is_verified, so we must encode it using LabelEncoder. Before proceeding with UMAP, we need to rescale input variables using StandardScaler.

Now, let’s using UMAP. UMAP has some hyperparameters that we need to adjust. “metric” and “neighbor_num” tend to make a bigger difference in the result. For your information, “metric” measures the distance between each point, and “neighbor_num” balances local and global structures. In this blog, I change the metric and neighbor_num to find distinguishable low-dimensional representations. The following graphs show the results of UMAP.

def apply_umap(features: np.array, random_state: int = 7, metric: str = 'euclidean', neighbor_num: int = 15):
reducer = umap.UMAP(random_state=random_state, n_neighbors=neighbor_num, metric=metric)
embedding = reducer.fit_transform(features)

return reducer, embedding

def visualize_sample(df: pd.DataFrame, embedding: np.array, metric: str):
df['umap_feat1'] = embedding[:, 0].reshape(-1, 1)
df['umap_feat2'] = embedding[:, 1].reshape(-1, 1)

fig = px.scatter(sample_df, x='umap_feat1', y='umap_feat2', hover_data=scaled_columns, title=f'{metric}')
fig.show()

metric = 'euclidean'
_, embedding = apply_umap(scaled_features, neighbor_num=10, metric=metric)
visualize_sample(sample_df, embedding, metric)

Explainable Clustering — The Introduction of Recursive Embedding and Clustering and its Application (7)

The case using Euclidean space as a metric shows so-so distinguishability, so I chose Euclidean one for the HDBSCAN input.

The next step is HDBSCAN part. HDBSCAN also has hyperparameters, such as the minimum number of data assigned to one cluster. Note that you need to adjust hyperparameters to be compatible with your data. The HDBCAN clustering result is shown below.

# HDBSCAN
hdb = HDBSCAN(min_cluster_size=10)
hdb.fit(embedding)
labels = hdb.labels_

sample_df['HDBSCAN_label'] = labels.reshape(-1, 1)
# Change the type of label for better visualization in plotly
sample_df['HDBSCAN_label'] = sample_df['HDBSCAN_label'].astype(str)

fig = px.scatter(sample_df, x='umap_feat1', y='umap_feat2', color='HDBSCAN_label', hover_data=scaled_columns)
fig.show()

The left chart shows the HDBSCAN result using all data, and the right one shows the result except for data not assigned to any clusters. HDBSCAN found 23 clusters in this data. To examine the clusters thoroughly, I found each cluster has the following features. I will show you some of the main clusters below.

  • Cluster 0: verified influencer cluster.
  • Cluster 1: non-verified and many followers with low engagement rate cluster.
  • Cluster 4: Normal number of followers, following, and more engagement level clusters.
  • Cluster 13: the number of following is larger than the number of followers.

To find the feature of the cluster, I visualize the distribution in the target cluster. For example, we can do it as follows:

# cluster 1
cls1_df = sample_df[sample_df['HDBSCAN_label'] == '1']

fig = make_subplots(rows=3, cols=2)
viz_columns = ['awg_engagement_rate', 'comment_engagement_rate', 'like_engagement_rate', 'followers', 'following', 'videos_count']

for i in range(len(viz_columns)):
x = i % 2 + 1
y = i // 2 + 1

col = viz_columns[i]
values = cls1_df[col]

fig.add_trace(go.Histogram(x=values, name=viz_columns[i]), row=y, col=x)
fig.show()

Explainable Clustering — The Introduction of Recursive Embedding and Clustering and its Application (8)

The above example shows cluster 1’s feature distribution. As you can see, this cluster has many followers and a low rate for three engagement features.

Now, let’s train XGBoost for each cluster to verify that my assumption is correct. In this blog, I will skip training the small-sized clusters because they tend to be difficult to train and produce less business value than other bigger clusters. The following figures show the results for the bigger clusters.

# To cut off the smaller size of clusters
# But if you want to pass specified cluster, please enter the label into ok_labels list
def label_check(df_length: int, label: str, minimum_num: int = 40, ok_labels: list = []):
if label in ok_labels:
return True

if label == '-1':
return False

if df_length < minimum_num:
return False
else:
return True

minimum_num = 30 # the minumum number of the cluster size
ok_labels = ['9']
labels = sample_df['HDBSCAN_label'].unique().tolist()

os.makedirs('models', exist_ok=True)

for label in labels:
df_length = len(sample_df[sample_df['HDBSCAN_label'] == label])

if not label_check(df_length, label, ok_labels=ok_labels):
print(f'skip Label {label}')
continue

print(f'processing labels {label}')
sample_df['label'] = 0
sample_df.loc[sample_df['HDBSCAN_label'] == label, 'label'] = 1

X_train, X_test, y_train, y_test = train_test_split(sample_df[scaled_columns].values,
sample_df['label'],
stratify=sample_df['label'],
test_size=0.05)
print(np.unique(y_train, return_counts=True)[-1])
train_df = pd.DataFrame(X_train, columns=scaled_columns)
test_df = pd.DataFrame(X_test, columns=scaled_columns)

classes_weights = class_weight.compute_sample_weight(
class_weight='balanced',
y=y_train
)

tree = XGBClassifier(n_estimators=2, max_depth=2, objective='binary:logistic', sample_weight=classes_weights)
tree.fit(train_df, y_train)

preds = tree.predict(X_test)

print(f"\nLabel{label}: Classification Report:")
print(classification_report(y_test, preds))

xgboost.plot_importance(tree)
pl.title(f"Label{label}: xgboost.plot_importance(model)")
pl.show()

explainer = shap.Explainer(tree, train_df)
shap_values = explainer(train_df)

shap.plots.bar(shap_values)

tree.save_model(f'models/xgboost-{label}.model')

Explainable Clustering — The Introduction of Recursive Embedding and Clustering and its Application (9)
Explainable Clustering — The Introduction of Recursive Embedding and Clustering and its Application (10)

According to the result, XGBoost tends to focus on four features: “videos_count,” “following,” “followers,” and “awg_engagement_rate.” Furthermore, the explanation from SHAP is similar to my assumption. Therefore, you can gain a better understanding of the cluster in the first stage via XGBoost + SHAP. Note that it might be overfitted if your data size is small, so you need to add some regularization methods to prevent it.

Once you complete training XGBoost, you can use it when you get new data. Isn’t it very convenient? Now, you get the powerful tool to explain to your stakeholders why the cluster exists easily.

Explainable Clustering — The Introduction of Recursive Embedding and Clustering and its Application (2024)

FAQs

What is clustering and application of clustering? ›

Clustering is an unsupervised machine learning technique designed to group unlabeled examples based on their similarity to each other. (If the examples are labeled, this kind of grouping is called classification.) Consider a hypothetical patient study designed to evaluate a new treatment protocol.

Which of the following utilities of sklearn cluster is used for performing k-means clustering density based clustering? ›

KMeans module to perform K-Means clustering. While computing cluster centers and value of inertia, the parameter named sample_weight allows sklearn. cluster. KMeans module to assign more weight to some samples.

What are some real life applications of clustering algorithms? ›

Here are some real-world applications of cluster analysis:
  • Market segmentation. Companies leverage cluster analysis to segment their customer base into different groups. ...
  • Image segmentation in healthcare. ...
  • Recommendation engines. ...
  • Risk analysis in insurance. ...
  • Social media analysis.
Oct 25, 2023

What is the main purpose of clustering? ›

Clustering is used to identify groups of similar objects in datasets with two or more variable quantities. In practice, this data may be collected from marketing, biomedical, or geospatial databases, among many other places.

What is the application of cluster system? ›

The primary purpose of using a cluster system is to assist with weather forecasting, scientific computing, and supercomputing systems. There are two clusters available to make a more efficient cluster. These are as follows: Software Cluster.

What are the applications of K clustering? ›

KMeans is used across many fields in a wide variety of use cases; some examples of clustering use cases include customer segmentation, fraud detection, predicting account attrition, targeting client incentives, cybercrime identification, and delivery route optimization.

What is clustering and what are its applications in data preprocessing? ›

Data clustering involves grouping data based on inherent similarities without predefined categories. The main benefits of data clustering include simplifying complex data, revealing hidden structures, and aiding in decision-making. Let's understand more with the help of an example.

What is a common application of cluster analysis? ›

The most common use of cluster analysis is classification. Subjects are separated into groups so that each subject is more similar to other subjects in its group than to subjects outside the group.

Top Articles
Latest Posts
Article information

Author: Van Hayes

Last Updated:

Views: 5971

Rating: 4.6 / 5 (46 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.