Barre actualitésMenu ActualitésNon classéPage_Accueil

Workshop on Multiblock Data Analysis and Related Methods

 

WORKSHOP ON MULTIBLOCK DATA ANALYSIS AND RELATED METHODS

 JANUARY 17, 2025

CONSERVATOIRE NATIONAL DES ARTS ET METIERS

INTERNATIONAL ORGANIZING COMMITTEE
Ndèye Niang (France), Alba Martínez-Ruiz (Chile), Arthur Tenenhaus (France), Rosaria Lombardo (Italy), Mohamed Hanafi (France), Laurent Le Brusquet (France)

Local Time: Paris – France, CET

The Laboratoire des Signaux & Systèmes at CentraleSupélec, the Conservatoire National des Arts et Métiers, the DataIA Institute, the Société Francophone de Classification (SFC), and the Latin American Regional Section of the International Association for Statistical Computing (IASC-LARS) are pleased to invite you to a one-day workshop dedicated to exploring advancements in the analysis of multiblock data and related methods. This event will bring together researchers and practitioners to discuss a range of methodological and applied topics, including structural equation modeling, tensor analysis, multi-way contingency tables, and clustering approaches for multiblock data.

The workshop will feature presentations by leading experts who will share cutting-edge developments and insights in the field. This is a unique opportunity to exchange ideas, explore applications, and foster collaborations around multiblock methods and related concepts.

The workshop has been supported by the Assistance Publique – Hôpitaux de Paris (APHP), the SFC via the CNAM, CentraleSupélec, and the DataIA Institute, and endorsed by the University of Campania “Luigi Vanvitelli”, the Oniris VetAgroBio Nantes École Nationale, the Italian Statistical Society (SIS), and the International Association for Statistical Computing (IASC).

Participation is free of charge, with compulsory registration. For organizational purposes, please register using the Registration Form (https://framaforms.org/workshop-multiblock-data-analysis-and-related-methods-1733767359) by January 10, 2025.

We look forward to your participation!

AGENDA

8.00 – 8.45

 

Welcome reception

8.45 – 9.00

 

Welcome words and opening ceremony

9.00 – 9.45

 

Structural equation modelling with both factors and components
Heungsun Hwang, McGill University, Canada

9.45 – 10.30

 

Regularized approximate factor analysis for exploratory SEM of (high-dimensional) multiblock data
Katrijn Van Deun, Tilburg University, The Netherlands

10.30 – 11.15

 

Coffee Break

11.15 – 12.00

 

A network approach to Joint Dimension Reduction of a set of data tables
Mohamed Hanafi, Oniris VetAgroBio, France

12.00 – 14.00

 

Lunch

14.00 – 14.45

 

Variants of Three-way Correspondence Analysis: Analysing gender differences in patients with type 1 diabetes and cardiovascular complications
Rosaria Lombardo, University of Campania “Luigi Vanvitelli”, Italy
Eric J. Beh, University of Wollongong, Australia

14.45 – 15.10

 

Sparse and integrative principal component analysis for Multiview data
Luo Xiao, North Carolina State University, USA

15.10 – 15.30

 

Tensor multiblock logistic regression
Alexandre Selvestrel, Paris-Saclay University CentraleSupélec, France

15.30 – 16.15

 

Coffee Break

16.15 – 17.00

 

Distance-based learning for mixed-type data
Alfonso Iodice D’Enza, University of Naples Federico II, Italy

17.00 – 17.45

 

An overview of current trends in multi-view clustering
Mohamed Nadif, Université Paris Cité, France

17.45 – 18.00

 

Closing words

Structural equation modelling with both factors and components

Heungsun Hwang
Department of Psychology
McGill University, Canada

Abstract. Structural equation modelling (SEM) is widely used to examine theory-driven relationships between constructs, such as self-esteem, depression, socioeconomic status, etc. Constructs are abstract concepts that are not directly measurable and are represented by entities linked to empirical data or observed variables in statistical models. This allows researchers to test hypotheses about their relationships. In SEM, constructs have been represented as factors (also known as latent variables) or as weighted composites of observed variables, referred to as components.

As psychology and many other sciences become interdisciplinary, there is an increasing need to simultaneously consider distinct types of constructs to understand human behaviour and cognition from more diverse perspectives. Some constructs can be better represented as factors, while others can be better represented as components. For instance, researchers are increasingly interested in the influences of genetic variation and/or altered brain activities on the variation of psychological constructs in cognition, personality, or mental disorders. Psychological constructs are typically considered factors, while genetic or imaging constructs, such as genes and brain regions, can be regarded as components.

Existing SEM methods are not suitable for estimating models that include both factors and components. Therefore, I recently proposed an SEM method, termed integrated generalized structured component analysis (IGSCA), to estimate such models. I will discuss the conceptual background of IGSCA and demonstrate its potential in real data applications with an investigation of the effects of multiple genes on depression severity. I will also briefly discuss ongoing extensions of the method and illustrate how to use it with the free, user-friendly software GSCA Pro (https://www.gscapro.com/).

Short Bio.Dr. Heungsun Hwang is a Professor of Psychology at McGill University, where he also completed his Ph.D. in Quantitative Psychology. His research focuses on the development and application of advanced quantitative analytics for measuring and analyzing human characteristics, behaviours, and processes. Currently, he is engaged in integrating statistics, psychology, and machine learning to incorporate individuals’ multifaceted information—such as psychological, physiological, imaging, and genetic data—to enhance the understanding and prediction of behavioural and cognitive differences. He has served on the editorial boards of several journals, including Psychometrika, Psychological Science, Behaviormetrika, and the British Journal of Mathematical and Statistical Psychology. Lab website: https://sites.google.com/view/hwanglab.

Back to AGENDA

Regularized approximate factor analysis for exploratory of SEM
(high-dimensional) multiblock data

Katrijn Van Deun
Tilburg University, The Netherlands

Abstract. Non-observable constructs such as personality, intelligence, and well-being are at the core of research on human behaviour and cognition. Latent variable methods (e.g., factor analysis, structural equation modelling) are therefore an indispensable tool for research in the social and behavioural sciences. These methods are known to work well when the number of parameters to estimate is relatively small compared to the sample size. However, modern research relies on large collections of multidisciplinary data where several blocks of variables have been measured on the same persons. Currently available latent variable methods are restrictive in their use, often not allowing to analyze high-dimensional data and/or taking the multi-block structure into account.
Here, we propose a regularized latent variable method that addresses these issues by relying on an approximate factor analysis approach and a strong computational framework.

Short Bio. Dr. Katrijn Van Deun is a Professor in Data Science for the Social and Behavioral Sciences at Tilburg University (The Netherlands). Her current research -funded by the Dutch Research Council (NWO)- focuses on the development of latent variable methods for complex high-dimensional multi-block data.

Back to AGENDA

A network approach to Joint Dimension Reduction of a set of data tables

Mohamed Hanafi
StatSC, Oniris VetAgroBio, France

Abstract. To deal with the dimension reduction task simultaneously of different classes of multi-block data, we propose to model these classes in the form of graphs, introducing the notion of networks between data tables. Several examples are presented to illustrate this new notion and its generic nature. A generalization of the known Eckart-Young problem from a single table to a network of tables is formulated leading to an exploratory method for analyzing a network of tables. Main steps of an ALS type algorithm for solving this problem are described. An illustration based on real data is presented.

 Short Bio. Mohamed Hanafi is a senior researcher and assistant director of StatSC (Oniris-VetAgroBio, Nantes). His work at the interface of chemometrics, statistics, and mathematics is dedicated to multi-block data analysis. For over 25 years, he becomes part of a global and integrated approach, encompassing a broad spectrum of contributions, from conceptualization to application. In this context, he solved methodological as well as computational issues, leading to major clarifications of method and algorithm mechanisms, thus opening the way to numerous developments.

Back to AGENDA

Variants of Three-way Correspondence Analysis: Analysing gender differences in patients with type 1 diabetes and cardiovascular complications

Rosaria Lombardo* and Eric Beh**
*University of Campania “Luigi Vanvitelli”, Italy
**University of Wollongong, Australia

Abstract. In the context of multi-view data analysis, three-way contingency tables provide a natural representation of datasets structured across multiple perspectives or views. When analyzing associations in a three-way contingency table, Pearson’s three-way chi-squared statistic is a valuable tool. This measure is a specific case of the three-way generalization of the CressieRead family of divergence statistics (1984), as introduced by Pardo (1996). This family of three-way divergence statistics also encompasses generalizations of the Freeman-Tukey statistic, the modified chi-squared statistic, and the modified log-likelihood ratio statistic. Variants of three-way correspondence analysis (Carlier and Kroonenberg, 1996; Lombardo, Beh, and Kroonenberg, 2021), based on the family of divergence statistics, are applied to assess the statistical significance of the associations and to visualize the interactions among variable categories. Using data from the Swedish National Diabetes Register, these three-way CA variants explore the gender disparity among patients with type 1 diabetes who have experienced cardiovascular complications (CVCs).
References
Carlier, A., Kroonenberg, P. M.: Biplots and decompositions in two-way and three-way correspondence analysis. Psychometrika, 61, 355–373 (1996)
Cressie, N. A. C., Read, T. R. C.: Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society, Series B 46, 440–464 (1984)
Lombardo, R, Beh, E.J., Kroonenberg, P.: Symmetrical and non-symmetrical variants of three-way correspondence analysis for ordered variables. Statistical Science, 36 (4), 542 – 561 (2021)
Pardo, M. C.: An empirical investigation of Cressie and Read tests for the hypothesis of independence in three-way contingency tables. Kybernetika, 32, 175–183 (1996)

Short Bio. Rosaria Lombardo earned her PhD in Computational Statistics and Applications from the University of Naples “Federico II” and is currently a Professor of Statistics in the Department of Economics at the University of Campania “Luigi Vanvitelli.” Her research interests encompass multidimensional data analysis, linear and non-linear partial least squares regression, quantification theory, and, in particular, correspondence analysis and data visualization. Since 2016, she has been an elected member of the International Statistical Institute (ISI). Currently, Prof. Lombardo serves on the Executive Committee of the International Association for Statistical Computing (IASC) for the 2023–2025 term. She has authored or co-authored over 120 publications, including four books published by Wiley and Springer. Prof. Lombardo’s international experience includes roles as a visiting researcher and professor at prestigious institutions such as the University of Montpellier (France), the Institute of Mathematical Statistics in Tachikawa (Japan), the University of Okayama (Japan), the University of Leiden (Netherlands), the University of Rotterdam (Netherlands), the University of Newcastle (Australia), and the University of Stellenbosch (South Africa).

Back to AGENDA

Sparse and integrative principal component analysis for multiview data

Luo Xiao
North Carolina State University, USA

Abstract. We consider dimension reduction of multiview data, which are emerging in scientific studies. Formulating multiview data as multivariate data with block structures corresponding to the different views, or views of data, we estimate top eigenvectors from multiview data that have two-fold sparsity, elementwise sparsity and blockwise sparsity. We propose a Fantope-based optimization criterion with multiple penalties to enforce the desired sparsity patterns and a denoising step is employed to handle potential presence of heteroskedastic noise across different data views. An alternating direction method of multipliers (ADMM) algorithm is used for optimization. We derive the ℓ2 convergence of the estimated top eigenvectors and establish their sparsity and support recovery properties. Numerical studies are used to illustrate the proposed method.

Short Bio. Dr. Luo Xiao is Associate Professor of Statistics at North Carolina State University in the US. He obtained his PhD from Cornell University Department of Statistics and Data Science in 2012 and completed a postdoctoral training at Johns Hopkins University Bloomberg School of Public Health. He has been a faculty member at NC State since 2015. His research interests focus on developing nonparametric and high dimensional regression methods with applications in wearable computing and neuroimaging, among other biostatistical fields.

Back to AGENDA

Tensor multiblock logistic regression

Alexandre Selvestrel
Paris-Saclay University CentraleSupélec, France

Abstract. In some applications, data inherently exhibit a multiblock and tensor structure. In such cases, standard classification methods fail to capture the complexity of the data adequately, making multiway and multiblock models more suitable. This work introduces a novel approach called tensor multiblock logistic regression that can efficiently handles such complex structures. This approach allows for a better quantification of the discriminative power of the blocks and provides a separable interpretation of each tensor-block’s modes. The efficiency of the method is assessed on a real liver tumor dataset from Henri Mondor Hospital. This dataset includes liver MRI images of hepatocellular carcinoma and cholangiocarcinoma obtained at four distinct time points, along with clinical variables. The performance of tensor multiblock logistic regression is benchmarked against traditional methods, including lasso and group lasso logistic regression.

Short Bio. Alexandre Selvestel is a first-year PhD student at the University of Paris-Saclay, in the Laboratoire des Signaux et Systèmes, under the supervision of Laurent Le Brusquet and Arthur Tenenhaus. His research interests concern the development of statistical methods for multiblock and tensor data analysis.

Back to AGENDA

Distance-based learning for mixed-type data

Alfonso Iodice D’Enza
University of Naples Federico II, Italy

Abstract. In many statistical learning methods, measuring similarity or dissimilarity among observations is crucial. The quantification of pairwise dissimilarities, commonly referred to as distance, plays a fundamental role. Both supervised methods, such as K-Nearest Neighbors (K-NN), and unsupervised methods, including clustering techniques like K-means, Partitioning Around Medoids (PAM), and hierarchical linkage, are inherently distance-based. The choice of a distance measure depends on several factors: i) the nature of the attributes involved, ii) the method used to aggregate intra-attribute differences, and iii) whether inter-attribute interactions are considered. For numerical data, distance measures often rely on the magnitude of observed differences, such as with the well-known (squared) Euclidean or Manhattan distances. For categorical attributes, the challenge lies in going beyond simple matches or mismatches. While the aforementioned distances are additive and do not account for inter-attribute relationships, the Mahalanobis distance offers a more nuanced approach for numerical data by incorporating such relationships. Several analogous measures to Mahalanobis distance have been proposed for categorical data, too (Van de Velden et al. 2024). When dealing with mixed data, distance computation extends beyond selecting appropriate measures for each attribute type. A further consideration is the commensurability of distances, which ensures balanced contributions from variables of different nature (Van De Velden et al. 2024). In this talk, we will explore the key aspects of distance computations, including their design and implications. Applications to real and synthetic mixed-type data will illustrate the practical impact of these methods across diverse learning tasks.
References
Velden, M. Van de, A. Iodice D’Enza, A. Markos, and C. Cavicchia. 2024. “A General Framework for Implementing Distances for Categorical Variables”. Pattern Recognition 153: 110547–48
Velden, Michel Van De, Alfonso Iodice D’Enza, Angelos Markos, and Carlo Cavicchia. 2024. “Unbiased Mixed Variables Distance”. Arxiv Preprint Arxiv:2411.00429

Short Bio. Alfonso Iodice D’Enza is an associate Professor of Statistics at the University of Naples Federico II (Italy). His areas of interest include statistical learning, clustering, dimension reduction, computational statistics, visualisation and statistical software engineering, with applications in behavioural sciences.

Back to AGENDA

An overview of current trends in multi-view clustering

Mohamed Nadif
Université Paris Cité, France

Abstract. With advancements in information acquisition technologies, multi-view data has become ubiquitous. Leveraging such data through clustering, which integrates information from multiple views, can be more relevant or complementary than clustering each view individually. Various models and algorithms, derived from diverse objectives, are valuable for analyzing such data, which can be collected from various sources or represented in multiple ways. Therefore, this presentation reviews different approaches, including generative and discriminative methods, leading from popular to cutting-edge algorithms.

Short Bio. Mohamed Nadif is a professor at Université Paris Cité and a member of the Borelli Center UMR 9010. He heads the AI-DSCy (Artificial Intelligence for Data Science and Cybersecurity) team. His research focuses on machine learning, with a strong emphasis on unsupervised learning through various approaches such as factorization, spectral methods, probabilistic models, and deep learning. As a result, representation learning and clustering are key components of his research. His work has applications in text mining, natural language processing (NLP), recommendation systems, and bioinformatics.

Back to AGENDA