Clustering Big Data with Mixed Features

Date: Thursday December 10th, 2020
Location: Zoom (the link will be posted soon)
Time: 12.00pm WET

Speaker

Joshua Tobin from School of Computer Science & Statistics, Trinity College Dublin

Abstract

Clustering large, mixed data is a central problem in data mining. Many approaches adopt the idea of k-means, and hence are sensitive to initialisation, detect only spherical clusters, and require a priori the unknown number of clusters. We develop a new clustering algorithm for large data of mixed type, aiming at improving the applicability and efficiency of the peak-finding technique. The improvements are threefold: (1) the new algorithm is applicable to mixed data; (2) the algorithm is capable of detecting outliers and clusters of relatively lower density values; (3) the algorithm is competent at deciding the correct number of clusters. The computational complexity of the algorithm is greatly reduced by applying a fast k-nearest neighbours method and by scaling down to component sets. We present experimental results to verify that our algorithm works well in practice.

Supplementary Materials

Paper: The paper is available on Arxiv here

Codes: The associated Python library, CPFcluster, is available here.

Registration

Registration is free and open to everyone.
Please click here to register.
Further details on this webinar will be posted in the coming days.

Webinar Video

Category: Webinars