When using MiniBatchKMeans with BERTopic, it’s common for some data to not be assigned a topic due to
High Dimensionality of Embeddings: Embeddings may be too sparse or not well-clustered.
Noise in Data: Some data points might not clearly belong to any cluster.
How to solve this Issue:
Tune n_clusters in MiniBatchKMeans:
• Start by testing different values for n_clusters. If it’s too low, some topics may merge, and if it’s too high, many data points may be left unclustered.
from sklearn.cluster import MiniBatchKMeans
cluster_model = MiniBatchKMeans(n_clusters=50, random_state=42)
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", hdbscan_model=cluster_model)
Use a Different Clustering Algorithm:
BERTopic allows for other clustering models. For instance, using HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is often more flexible.
Example
from hdbscan import HDBSCAN
cluster_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom')
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", hdbscan_model=cluster_model)
Reduce Dimensionality Before Clustering:
Use dimensionality reduction (e.g., UMAP) to make the data more clusterable:
from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=5, metric='cosine')
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", umap_model=umap_model)
Analyze Unassigned Data:
Check what makes the unassigned data different. These may be outliers or too generic to form a unique topic.
Example:
unassigned_data = [doc for doc, topic in zip(documents, topics) if topic == -1]
Increase Training Data Size:
If your dataset is too small, clustering might struggle to find consistent patterns.
Adjust BERTopic Parameters: min_topic_size: Set a smaller value to allow smaller topics to form.
• n_gram_range: Experiment with different n-gram ranges in BERTopic.
topic_model = BERTopic(n_gram_range=(1, 3), min_topic_size=5)
Refine Preprocessing:
Ensure text data is clean, normalized, and free of irrelevant tokens or stopwords.
Debugging:
•After making changes, check how many data points are still unclustered:
unclustered_count = sum(1 for t in topics if t == -1)
print(f"Unclustered points: {unclustered_count}")