1. Introduction of Recursive embedding and clustering method
Spotify has around 600 million users worldwide. A deeper understanding of users is essential to provide a better user experience, but because of its popularity, it is challenging. Moreover, although user segmenting by clustering is one of the effective methods, also because of its popularity, it is difficult to explain or examine why the cluster exists. Therefore, Spotify developed the explainable clustering method to find significant clusters and explain why those clusters exist. The overall architecture is shown below.
It mainly comprises two stages.
- Find the low-dimensional representation from the original data using UMAP and cluster it using HDBSCAN. Examine the clusters’ characteristics.
- Train XGBoost using the raw data as input and labels that HDBSCAN classified as output and understand the feature contribution using SHAP.
Let’s dive into it step by step.
For the first stage, we utilize UMAP and HDBSCAN to construct the understandable clusterings. Let’s quickly recap about UMAP and HDBSCAN.
1.1 UMAP
Uniform Manifold Approximation and Projection (UMAP) is one of the non-linear dimension reduction algorithms. UMAP learns the manifold structure in the input data’s high-dimensional space and finds the low-dimensional representation, preserving the manifold structure as much as possible. It is classified in the same category as T-SNE. One of the advantages of UMAP is that it is more robust in preserving the high-dimensional structure than T-SNE because of the loss function (For your information, T-SNE uses KL divergence, and UMAP uses Cross-entropy loss). You can see more details in this blog [3].
Thus, we utilize UMAP as a dimension reduction algorithm to reduce the dimension of input data to 2 or 3 dimensions.
1.2 HDBSCAN
Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a density-based clustering method. It is fast and robust and finds meaningful clusters if there are any. The algorithm overview from the official guide is below.
- Transform the space according to the density/sparsity.
- Build the minimum spanning tree of the distance-weighted graph.
- Construct a cluster hierarchy of connected components.
- Condense the cluster hierarchy based on minimum cluster size.
- Extract the stable clusters from the condensed tree.
Therefore, we utilize HDBSCAN to find dense clusters based on the UMAP representation. We can assign labels using HDBSCAN and check the features of clusters by examining the raw data.
In summary, we conduct two processes in the first stage.
- Condense the high-dimensional information of input data to low-dimensional representation using UMAP.
- Find clusters using low-dimensional data by UMAP.
After processing this procedure, we can visualize the clusters and check the details using the raw data.
1.3 XGBoost + SHAP
XGBoost is a popular gradient-boosting tree algorithm. It solves many data science problems quickly and accurately. On the other hand, SHAP is one of the explainable AI methods for any machine learning algorithm. We can check the contribution of each feature using SHAP. Since we don’t want to proceed to the first stage many times, which is time-consuming, we train the XGBoost model using the high-dimensional data as input and HDBSCAN labels as output in the second stage. Note that we use XGBoost as a one-versus-all model for each cluster to allow very fast, accurate training and a precise understanding of the feature’s importance. The illustration of this process is as follows:
We train multiple XGBoost classifiers for each label and can get the detailed importance of the feature using SHAP.
So far, we have gone through the algorithmic perspective of this method. In the next section, let’s see the power of this method using real-world data!
2. Real-world data Application: E-commerce-Users of a French C2C fashion store dataset
In the last section, I will guide you through applying recursive embedding and clustering methods using the TikTok user profile dataset [4]. As its name suggests, the TikTok user profile dataset contains the user information, and each row represents one user’s information. Let’s implement it step by step.
As the first step, you need to set an environment. I used ubuntu20.04 and Python3.10 environment. Firstly, I create the virtual environment using conda.
conda create --name rclustering python==3.10 -y
conda activate sam
conda install pip
## optional: To avoid install libraries on the local environment,
## check the which pip will be used to store libraries
which pip
# I use /opt/conda/envs/rclustering/bin/pip in my enviornment.
Next, you need to install the following libraries.
conda install -c conda-forge umap-learn=0.5.6 -y
conda install -c plotly plotly=5.23.0 -y
conda install -c conda-forge pandas=2.2.2 -y
conda install -c conda-forge nbformat -y
conda install -c conda-forge py-xgboost -y
conda install -c conda-forge shap -y
conda install -c conda-forge matplotlib
We’ve finished to set an environment, so let’s see the TikTok user profile dataset. I will attach the whole code that I used later.
The TikTok user profile dataset contains 1,000 rows and 19 columns. The first five rows are shown below.
df = pd.read_csv('../TikTok profiles dataset (Public web data).csv')
df.head()
The dataset contains the following columns besides the unrelated columns in this setting, such as timestamp, biolink, create_time, id, top_videos, URL, and profile_pic_url.
- account_id : Unique identifier for each account.
- nickname : account name.
- biography : account’s biography.
- awg_engagement_rate : average engagement rate.
- comment_engagement_rate
- like_engagement_rate
- is_verified : the verified status on TikTok.
- followers : the number of followers.
- following : the number of following creators.
- likes : the number of likes the user have obtained.
- videos_count : the number of videos.
The dataset distribution is shown below.
fig = make_subplots(rows=3, cols=2)
viz_columns = ['awg_engagement_rate', 'comment_engagement_rate', 'like_engagement_rate', 'followers', 'following', 'videos_count']for i in range(len(viz_columns)):
x = i % 2 + 1
y = i // 2 + 1
col = viz_columns[i]
values = df[col]
fig.add_trace(go.Histogram(x=values, name=viz_columns[i]), row=y, col=x)
fig.show()
As you can see, the distribution of all features tends to be right-skewed, and each feature’s scale varies.
For the next step, we need to preprocess data. In this case, we can drop account_id, nickname, and biography for simplicity. If you want to use nicknames and biographies, such as text information, we need to utilize the language model to encode this information. Moreover, we have one categorical variable, is_verified, so we must encode it using LabelEncoder. Before proceeding with UMAP, we need to rescale input variables using StandardScaler.
Now, let’s using UMAP. UMAP has some hyperparameters that we need to adjust. “metric” and “neighbor_num” tend to make a bigger difference in the result. For your information, “metric” measures the distance between each point, and “neighbor_num” balances local and global structures. In this blog, I change the metric and neighbor_num to find distinguishable low-dimensional representations. The following graphs show the results of UMAP.
def apply_umap(features: np.array, random_state: int = 7, metric: str = 'euclidean', neighbor_num: int = 15):
reducer = umap.UMAP(random_state=random_state, n_neighbors=neighbor_num, metric=metric)
embedding = reducer.fit_transform(features) return reducer, embedding
def visualize_sample(df: pd.DataFrame, embedding: np.array, metric: str):
df['umap_feat1'] = embedding[:, 0].reshape(-1, 1)
df['umap_feat2'] = embedding[:, 1].reshape(-1, 1)
fig = px.scatter(sample_df, x='umap_feat1', y='umap_feat2', hover_data=scaled_columns, title=f'{metric}')
fig.show()
metric = 'euclidean'
_, embedding = apply_umap(scaled_features, neighbor_num=10, metric=metric)
visualize_sample(sample_df, embedding, metric)
The case using Euclidean space as a metric shows so-so distinguishability, so I chose Euclidean one for the HDBSCAN input.
The next step is HDBSCAN part. HDBSCAN also has hyperparameters, such as the minimum number of data assigned to one cluster. Note that you need to adjust hyperparameters to be compatible with your data. The HDBCAN clustering result is shown below.
# HDBSCAN
hdb = HDBSCAN(min_cluster_size=10)
hdb.fit(embedding)
labels = hdb.labels_sample_df['HDBSCAN_label'] = labels.reshape(-1, 1)
# Change the type of label for better visualization in plotly
sample_df['HDBSCAN_label'] = sample_df['HDBSCAN_label'].astype(str)
fig = px.scatter(sample_df, x='umap_feat1', y='umap_feat2', color='HDBSCAN_label', hover_data=scaled_columns)
fig.show()
The left chart shows the HDBSCAN result using all data, and the right one shows the result except for data not assigned to any clusters. HDBSCAN found 23 clusters in this data. To examine the clusters thoroughly, I found each cluster has the following features. I will show you some of the main clusters below.
- Cluster 0: verified influencer cluster.
- Cluster 1: non-verified and many followers with low engagement rate cluster.
- Cluster 4: Normal number of followers, following, and more engagement level clusters.
- Cluster 13: the number of following is larger than the number of followers.
To find the feature of the cluster, I visualize the distribution in the target cluster. For example, we can do it as follows:
# cluster 1
cls1_df = sample_df[sample_df['HDBSCAN_label'] == '1'] fig = make_subplots(rows=3, cols=2)
viz_columns = ['awg_engagement_rate', 'comment_engagement_rate', 'like_engagement_rate', 'followers', 'following', 'videos_count']
for i in range(len(viz_columns)):
x = i % 2 + 1
y = i // 2 + 1
col = viz_columns[i]
values = cls1_df[col]
fig.add_trace(go.Histogram(x=values, name=viz_columns[i]), row=y, col=x)
fig.show()
The above example shows cluster 1’s feature distribution. As you can see, this cluster has many followers and a low rate for three engagement features.
Now, let’s train XGBoost for each cluster to verify that my assumption is correct. In this blog, I will skip training the small-sized clusters because they tend to be difficult to train and produce less business value than other bigger clusters. The following figures show the results for the bigger clusters.
# To cut off the smaller size of clusters
# But if you want to pass specified cluster, please enter the label into ok_labels list
def label_check(df_length: int, label: str, minimum_num: int = 40, ok_labels: list = []):
if label in ok_labels:
return True if label == '-1':
return False
if df_length < minimum_num:
return False
else:
return True
minimum_num = 30 # the minumum number of the cluster size
ok_labels = ['9']
labels = sample_df['HDBSCAN_label'].unique().tolist()
os.makedirs('models', exist_ok=True)
for label in labels:
df_length = len(sample_df[sample_df['HDBSCAN_label'] == label])
if not label_check(df_length, label, ok_labels=ok_labels):
print(f'skip Label {label}')
continue
print(f'processing labels {label}')
sample_df['label'] = 0
sample_df.loc[sample_df['HDBSCAN_label'] == label, 'label'] = 1
X_train, X_test, y_train, y_test = train_test_split(sample_df[scaled_columns].values,
sample_df['label'],
stratify=sample_df['label'],
test_size=0.05)
print(np.unique(y_train, return_counts=True)[-1])
train_df = pd.DataFrame(X_train, columns=scaled_columns)
test_df = pd.DataFrame(X_test, columns=scaled_columns)
classes_weights = class_weight.compute_sample_weight(
class_weight='balanced',
y=y_train
)
tree = XGBClassifier(n_estimators=2, max_depth=2, objective='binary:logistic', sample_weight=classes_weights)
tree.fit(train_df, y_train)
preds = tree.predict(X_test)
print(f"\nLabel{label}: Classification Report:")
print(classification_report(y_test, preds))
xgboost.plot_importance(tree)
pl.title(f"Label{label}: xgboost.plot_importance(model)")
pl.show()
explainer = shap.Explainer(tree, train_df)
shap_values = explainer(train_df)
shap.plots.bar(shap_values)
tree.save_model(f'models/xgboost-{label}.model')
According to the result, XGBoost tends to focus on four features: “videos_count,” “following,” “followers,” and “awg_engagement_rate.” Furthermore, the explanation from SHAP is similar to my assumption. Therefore, you can gain a better understanding of the cluster in the first stage via XGBoost + SHAP. Note that it might be overfitted if your data size is small, so you need to add some regularization methods to prevent it.
Once you complete training XGBoost, you can use it when you get new data. Isn’t it very convenient? Now, you get the powerful tool to explain to your stakeholders why the cluster exists easily.