Choosing Clustering Algorithm in Binning Policy
Binning-PG™
Will the use of different clustering algorithms affect the results of the binning policy generation? For example, will I get different end results if I used K-means instead of DBSCAN, and vice-versa? Or is it possible to use both clustering algorithms (i.e. alternate them) to get the ground truth? How do you know which clustering algorithm is the most optimal to use?
💬 Comments section
Hi Sophie,
The choice of clustering algorithm can significantly affect the outcome of binning policy generation. Each algorithm is based on different assumptions about data distribution, which leads to different clustering results in terms of shape, size, and cluster count.
❌ Limitations of DBSCAN in Our Use Case:
- While DBSCAN is theoretically good at detecting arbitrarily-shaped clusters and noise, it performs poorly in high-dimensional spaces—we are working with 14 features, which poses challenges.
- It requires manual tuning of hyperparameters like
ε(radius) andminPts, which are highly sensitive and dataset-dependent. - In practice, DBSCAN fails to consistently form meaningful clusters, often resulting in a large noisy group or many small fragmented clusters, making it ineffective for stable binning policy generation.
✅ Advantages of K-Means:
- K-Means scales better in high-dimensional spaces and, with KMeans++ initialization, avoids poor convergence caused by bad initial centroids.
- It allows for integration with dimensionality reduction (e.g., PCA) and feature selection, improving cluster quality.
- More importantly, K-Means provides interpretable and structured bins, which are more suitable for rule-based binning policy deployment.
✅ Can We Combine K-Means and DBSCAN?
While it is theoretically possible to compare or ensemble results from different clustering algorithms, alternating between DBSCAN and K-Means does not provide a definitive "ground truth":
- Their clustering assumptions (density vs. distance-based) are fundamentally different.
- Without ground truth labels, there’s no absolute reference to determine which result is correct.
- Instead, we rely on unsupervised validation metrics (e.g., Silhouette Score, Davies-Bouldin Index) to identify the most optimal clustering result for our application.
✅ Conclusion:
Given our dataset’s characteristics (high dimensionality and need for stable, interpretable bins), K-Means with feature optimization (e.g., KMeans++ and PCA) has proven to be significantly more practical and effective than DBSCAN for binning policy generation.
📝 Post a comment
🛒 Visit DTCO Shop
Enjoy latest products, try for FREE!
libMetric™
Copernic™
GRO Compiler
GAN-VS™
DM-VS™
Programming Skill
Explore More