PhD Dissertation at the College of Information Technology Discusses the Optimal K Value Estimation for the K-Means Algorithm in Data Stream Clustering
Duhaa Fadill Abbas
The Software Department at the College of Information Technology, University of Babylon, held a PhD dissertation defense titled "Optimal K Value Estimation for the K-Means Algorithm in Data Stream Clustering." The dissertation was presented by researcher Abeer Mahmoud Hassan Ahmed under the supervision of Dr. Saad Talib Hassoun on Thursday, March 13, 2025, in the College Conference Hall.
The dissertation addresses a fundamental challenge in data stream clustering, focusing on enhancing the K-Means algorithm, which is widely used due to its simplicity and efficiency. The research proposes novel models to determine the optimal K value for clustering continuous data streams, which are characterized by high velocity, large volume, heterogeneity, and real-time generation from multiple sources.
Proposed Models and Methodologies
The study introduces advanced clustering techniques, including:
- Adaptive Dynamic Diameter and Boundary Threshold (ADDBT), which utilizes a Probability Density Function (PDF) instead of the conventional Gaussian distribution.
- Prototype Multi-Channel (P-M-C) Model, which estimates the optimal K value based on prototype selection and frequency distribution.
- Experimental Evaluation and Real-World Data Collection
The proposed models were evaluated using eight streaming datasets, including four benchmark datasets:
- Iris Dataset
- Household Electricity Consumption Stream
- Global Traffic Signal Data Stream
- KDD Cup Data Stream
Additionally, real-time sensor-based data were collected and analyzed, including:
- Weather data recorded for 10 months in Hilla, Iraq.
- Human physiological data collected using wearable sensors over four months.
- Child health data monitored over four months across six children.
- Medical patient data gathered from 150 individuals in a pharmacy setting.
- Key Findings and Contributions
The results demonstrate significant improvements in clustering quality compared to traditional algorithms such as CMFT Kernel-K-Means, CluStream, and D-Stream. The Silhouette Score metric indicated an average enhancement of 10%, with performance gains ranging from 45% to 200% when compared to KM-STREEM++, IAPKM, and other baseline approaches.
This research presents a major advancement in real-time data clustering, offering robust, scalable, and adaptive techniques that enhance clustering accuracy for dynamic, high-velocity data streams. The findings contribute to various fields, including Artificial Intelligence, Internet of Things (IoT), medical data analysis, and environmental monitoring, providing a foundation for intelligent data-driven decision-making systems.