Comparison Of Clustering Algorithms In Analyzing E-Commerce Data From Kaggle
DOI:
https://doi.org/10.59888/insight.v4i1.105Keywords:
clustering algorithm, e-commerce, K-Means, DBSCAN, customer segmentation, RFMAbstract
The rapid growth of the e-commerce industry produces a huge volume of transaction data, so the right data analysis techniques are needed to extract valuable information for business decision-making purposes. This study aims to compare the performance of three clustering algorithms, namely K-Means, DBSCAN, and Hierarchical Clustering, in analyzing e-commerce datasets sourced from the Kaggle platform. The dataset used is "Online Retail II" published by Daqing Chen through the UCI Machine Learning Repository and Kaggle, containing 541,909 transactions from an online retail company in the UK; After the data cleansing process, a total of 406,829 valid transactions from 4,372 unique customers were used as the basis for analysis. The data was analyzed using the RFM (Recency, Frequency, Monetary) approach as the basis for clustering features for customer segmentation. The algorithm performance evaluation was carried out using three internal validation metrics, namely the Silhouette Score, the Davies-Bouldin Index (DBI), and the Calinski-Harabasz Index (CHI). The results showed that K-Means with k=3 produced the best performance with a Silhouette Score of 0.612 and the lowest DBI of 0.842, followed by Hierarchical Clustering with the Ward and DBSCAN methods. K-Means also excels in computing efficiency with an execution time of 1.23 seconds, much faster than Hierarchical Clustering which takes 8.72 seconds. The resulting segmentation identified three main customer groups: High-value Customers, 31.0%, Medium-value Customers, 43.3%, and passive or at-risk customers (Low-value/At-Risk Customers, 25.7%). These findings provide practical implications that can be directly applied by e-commerce businesses, particularly in designing segmented marketing strategies, loyalty programs, and customer reactivation campaigns based on the choice of clustering algorithms that match their data characteristics and business analytics needs.




