A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
(2024). Model-based clustering of categorical data based on the Hamming distance [journal article - articolo]. In JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION. Retrieved from https://hdl.handle.net/10446/304866
Model-based clustering of categorical data based on the Hamming distance
Argiento, Raffaele;
2024-01-01
Abstract
A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.| File | Dimensione del file | Formato | |
|---|---|---|---|
|
Model-Based Clustering of Categorical Data Based on the Hamming Distance.pdf
Solo gestori di archivio
Versione:
publisher's version - versione editoriale
Licenza:
Licenza default Aisberg
Dimensione del file
1.96 MB
Formato
Adobe PDF
|
1.96 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
Aisberg ©2008 Servizi bibliotecari, Università degli studi di Bergamo | Terms of use/Condizioni di utilizzo

