Model-based clustering of high-dimensional incomplete data via contaminated-normal mixtures

Leila Shahriari, Mehrdad Naderi*, Mohsen Khosravi

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

With the steady advancement of data collection technologies, practical users of statistical methods increasingly turn to the mixture of factor analysers (MFA) for model-based clustering and dimensionality reduction. However, as the number of measurements grows, so does the likelihood of missing data and outliers, which can lead to biased parameter estimates, reduced stability and robustness, and ultimately inaccurate inferences. This paper presents a new variant of MFA model that can accommodate missing data and mild outliers. The main assumption of the proposed model is that the latent factors and idiosyncratic errors follow jointly a contaminated-normal distribution, which incorporates parameters for automatic outlier detection. We develop the ECM and AECM algorithms to compute maximum likelihood parameter estimates. Asymptotic standard errors of parameters are derived by offering an information-based approach. Several simulation experiments are conducted to examine the asymptotic properties of the ML estimators and assess the model’s ability to mitigate the influence of missing data and outliers. We further illustrate the model’s practical applicability in social data analysis and image reconstruction, using cost-of-living data and the Barbara image as case studies. Software implementing the presented methodology is available at https://github.com/leila-shahriari/CNMFA-Model.

Original languageEnglish
Number of pages33
JournalAdvances in Data Analysis and Classification
Early online date1 Nov 2025
DOIs
Publication statusE-pub ahead of print - 1 Nov 2025

Keywords

  • Contaminated-normal distribution
  • Dimension reduction
  • Factor analysis
  • Heavy-tailed mixtures
  • Missing at random

Cite this