The rapid growth of astronomical imaging data from next-generation surveys necessitates automated and scalable approaches to galaxy morphology classification that transcend the limitations of supervised methods requiring manual labels. We present an unsupervised multimodal deep learning framework that integrates ConvNeXt-derived visual embeddings with quantitative morphological parameters including concentration, asymmetry, smoothness, Gini, and M20 to uncover natural taxonomic structures within galaxy populations. Using a strictly purged and physically verified sample of 4950 galaxies from the Sloan Digital Sky Survey, we engineered a PyTorch-based Multimodal Autoencoder (MAE) to compress features into a dense 64-dimensional bottleneck, successfully resolving the inherent dimensionality imbalance. Clustering was executed exclusively within this robust latent space utilizing a probabilistic Gaussian Mixture Model (GMM). An explicit ablation study confirmed that this multimodal architecture optimizes structural cohesion compared to isolated modalities. Furthermore, we established the astrophysical integrity of the unsupervised clusters through a new proxy external validation against classical heuristic constraints, achieving a 52.7% baseline alignment. By utilizing GMM log-likelihoods, we isolated extreme physical anomalies (limiting the noise fraction to 2.0%), producing a physically coherent taxonomy that maps seamlessly to Early-Type, Late-Type, and Interacting systems. Each galaxy was processed in ~ 27.6 ms, demonstrating strong scalability for upcoming large-scale surveys such as LSST and Roman. This study establishes a foundation for unsupervised morphology analysis at survey scale, advancing our understanding of galaxy evolution through multimodal deep representation learning.
Selim et al. (Sat,) studied this question.