(no entries)
Seminar on Statistics and Data Science
This seminar series is organized by the research group in statistics and features talks on advances in methods of data analysis, statistical theory, and their applications. The speakers are external guests as well as researchers from other groups at TUM. All talks in the seminar series are listed in the Munich Mathematical Calendar.
The seminar takes place in room 8101.02.110, if not announced otherwise. To stay up-to-date about upcoming presentations please join our mailing list. You will receive an email to confirm your subscription.
Upcoming talks
Previous talks
within the last 180 days
23.07.2025 12:15 Oezge Sahin (TU Delft, NL): Effects of covariate discretization on conditional quantiles in bivariate copulas
Clinical data often include a mix of continuous measurements and covariates that have been discretized, typically to protect privacy, meet reporting obligations, or simplify clinical interpretation. This combination, along with the nonlinear and tail-asymmetric dependence frequently observed in clinical data, affects the behavior of regression and variable-selection methods. Copula models, which separate marginal behavior from the dependence structure, provide a principled approach to studying these effects. In this talk, we analyze how discretizing a continuous covariate into equiprobable categories impacts conditional quantiles and likelihoods in bivariate copula models. For the Clayton and Frank families, we derive closed-form anchor points: for a given category, we identify the continuous covariate value at which the conditional quantile under the continuous model matches that of the discretized one. These anchors provide an exact measure of discretization bias, which is small near the center but can be substantial in the tails. Simulations across five copula families show that likelihood-based variable selection may over- or under-weight discretized covariates, depending on the dependence structure. Through simulations, we conclude by comparing polyserial and Pearson correlations, as well as Kendall’s tau (-b), in the same settings. Our results have practical implications for copula-based modeling of mixed-type data.
Source
23.07.2025 16:00 Thomas Nagler (LMU Munich): On dimension reduction in conditional dependence models
Inference of the conditional dependence structure is challenging when many covariates are present. In numerous applications, only a low-dimensional projection of the covariates influences the conditional distribution. The smallest subspace that captures this effect is called the central subspace in the literature. We show that inference of the central subspace of a vector random variable Y conditioned on a vector of covariates can be separated into inference of the marginal central subspaces of the components of Y conditioned on X and on the copula central subspace, which we define in this paper. Further discussion addresses sufficient dimension reduction subspaces for conditional association measures. An adaptive nonparametric method is introduced for estimating the central dependence subspaces, achieving parametric convergence rates under mild conditions. Simulation studies illustrate the practical performance of the proposed approach.
Source
22.07.2025 10:00 Junhyung Park (ETH Zürich, CH): Causal Spaces: A Measure-Theoretic Axiomatisation of Causality
While the theory of causality is widely viewed as an extension of probability theory, a view which we share, there was no universally accepted, axiomatic framework for causality analogous to Kolmogorov's measure-theoretic axiomatization for the theory of probabilities. Instead, many competing frameworks exist, such as the structural causal models or the potential outcomes framework, that mostly have the flavor of statistical models. To fill this gap, we propose the notion of causal spaces, consisting of a probability space along with a collection of transition probability kernels, called causal kernels, which satisfy two simple axioms and which encode causal information that probability spaces cannot encode. The proposed framework is not only rigorously grounded in measure theory, but it also sheds light on long-standing limitations of existing frameworks, including, for example, cycles, latent variables, and stochastic processes. Our hope is that causal spaces will play the same role for the theory of causality that probability spaces play for the theory of probabilities.
Source
09.07.2025 12:15 Nils Sturma (TU München): Identifiability in Sparse Factor Analysis
Factor analysis is a statistical technique that explains correlations among observed random variables with the help of a smaller number of unobserved factors. In traditional full-factor analysis, each observed variable is influenced by every factor. However, many applications exhibit interesting sparsity patterns, that is, each observed variable only depends on a subset of the factors. In this talk, we will discuss parameter identifiability of sparse factor analysis models. In particular, we present a sufficient condition for parameter identifiability that generalizes the well-known Anderson-Rubin condition and is tailored to the sparse setup. This is joint work with Mathias Drton, Miriam Kranzlmüller, and Irem Portakal.
Source
09.07.2025 13:15 Pratik Misra (TU München): Structural identifiability in graphical continuous Lyapunov models
Graphical continuous Lyapunov models offer a novel framework for the statistical modeling of correlated multivariate data. These models define the covariance matrix through a continuous Lyapunov equation, parameterized by the drift matrix of the underlying dynamic process. In this talk, I will discuss key results on the defining equations of these models and explore the challenge of structural identifiability. Specifically, I will present conditions under which models derived from different directed acyclic graphs (DAGs) are equivalent and provide a transformational characterization of such equivalences. This is based on ongoing work with Carlos Amendola, Tobias Boege, and Ben Hollering.
Source
23.06.2025 12:15 Speaker has cancelled: Qingqing Zhai (Shanghai University, CN): Modeling Complex System Deterioration: From Unit Degradation to Networked Recurrent Failures
This presentation addresses statistical challenges in modeling the deterioration of complex systems, spanning from individual unit degradation to interdependent network failures. First, we introduce statistical degradation data modeling using stochastic processes. Then, we shift to modeling recurrent failures in large-scale infrastructure networks (e.g., water distribution systems). Motivated by 16 years of Scottish Water pipe failure data, we propose the novel Network Gamma-Poisson Autoregressive NHPP (GPAN) model. This two-layer framework captures temporal dynamics via Non-Homogeneous Poisson Processes (NHPPs) with node-specific frailties and spatial dependencies through a gamma-Poisson autoregressive scheme structured by the network's Directed Acyclic Graph (DAG). To overcome computational intractability, a scalable sum-product algorithm based on factor graphs and message passing is developed for efficient inference, enabling application to networks with tens of thousands of nodes. We demonstrate how this approach provides accurate failure predictions, identifies high-risk clusters, and supports operational management and risk assessment. The methodologies presented offer powerful tools for reliability analysis across diverse engineering contexts, from product lifespan prediction to critical infrastructure resilience.
Source
04.06.2025 12:15 Gilles Blanchard (Université Paris-Saclay, FR): Estimating a large number of high-dimensional vector means
The problem of simultaneously estimating multiple means from independent samples has a long history in statistics, from the seminal works of Stein, Robbins in the 50s, Efron and Morris in the 70s and up to the present day. This setting can be also seen as an (extremely stylized) instance of "personalized federated learning" problem, where each user has their own data and target (the mean of their personal distribution), but potentially want to share some relevant information with "similar" users (though there is no information available a priori about which users are "similar"). In this talk I will concentrate on contributions to the high-dimensional case, where the samples and their means belong to R^d with "large" d.
\[ \]
We consider a weighted aggregation scheme of empirical means of each sample, and study the possible improvement in quadratic risk over the simple empirical means. To make the stylized problem closer to challenges encountered in practice, we allow (a) full heterogeneity of sample sizes
(b) zero a priori knowledge of the structure of the mean vectors (c) unknown and possibly heterogeneous sample covariances.
\[ \]
We focus on the role of the effective dimension of the data in a "dimensional asymptotics'' point of view, highlighting that the risk improvement of the proposed method satisfies an oracle inequality approaching an adaptive (minimax in a suitable sense) improvement as the effective dimension grows large.
\[ \]
(This is joint work with Jean-Baptiste Fermanian and Hannah Marienwald)
Source
21.05.2025 12:15 Michael Muma (TU Darmstadt): The T-Rex Selector: Fast High-Dimensional Variable Selection with False Discovery Rate Control
Providing guarantees on the reproducibility of discoveries is essential when drawing inferences from high-dimensional data. Such data is common in numerous scientific domains, for example, in biomedicine, it is imperative to reliably detect the genes that are truly associated with the survival time of patients diagnosed with a certain type of cancer, or in finance, one aims at determining a sparse portfolio to reliably perform index tracking. This talk introduces the Terminating-Random Experiments (T-Rex) selector, a fast multivariate variable selection framework for high-dimensional data. The T-Rex selector provably controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables. It scales to settings with millions of variables. Its computational complexity is linear in the number of variables, making it more than two orders of magnitude faster than, e.g., the existing model-X knockoff methods. An easy-to-use open-source R package that implements the TRexSelector is available on CRAN. The focus of this talk lies on high-dimensional linear regression models, but we also describe extensions to principal component analysis (PCA) and Gaussian graphical models (GGMs).
Source
14.05.2025 12:15 Luciana Dalla Valle (University of Torino, IT): Approximate Bayesian conditional copulas
According to Sklar’s theorem, any multidimensional absolutely continuous distribution function can be uniquely represented as a copula, which captures the dependence structure among the vector components. In real data applications, the interest of the analyses often lies on specific functionals of the dependence, which quantify aspects of it in a few numerical values. A broad literature exists on such functionals, however extensions to include covariates are still limited. This is mainly due to the lack of unbiased estimators of the conditional copula, especially when one does not have enough information to select the copula model. Several Bayesian methods to approximate the posterior distribution of functionals of the dependence varying according covariates are presented and compared; the main advantage of the investigated methods is that they use nonparametric models, avoiding the selection of the copula, which is usually a delicate aspect of copula modelling. These methods are compared in simulation studies and in two realistic applications, from civil engineering and astrophysics.
Source
14.05.2025 16:15 Rajen Shah (University of Cambridge, UK): Robustness in Semiparametric Statistics
Given that all models are wrong, it is important to understand the performance of methods when the settings for which they have been designed are not met, and to modify them where possible so they are robust to these sorts of departures from the ideal. We present two examples with this broad goal in mind.
\[ \]
We first look at a classical case of model misspecification in (linear) mixed-effect models for grouped data. Existing approaches estimate linear model parameters through weighted least squares, with optimal weights (given by the inverse covariance of the response, conditional on the covariates) typically estimated by maximizing a (restricted) likelihood from random effects modelling or by using generalized estimating equations. We introduce a new ‘sandwich loss’ whose population minimizer coincides with the weights of these approaches when the parametric forms for the conditional covariance are well-specified, but can yield arbitrarily large improvements when they are not.
\[ \]
The starting point of our second vignette is the recognition that semiparametric efficient estimation can be hard to achieve in practice: estimators that are in theory efficient may require unattainable levels of accuracy for the estimation of complex nuisance functions. As a consequence, estimators deployed on real datasets are often chosen in a somewhat ad hoc fashion and may suffer high variance. We study this gap between theory and practice in the context of a broad collection of semiparametric regression models that includes the generalized partially linear model. We advocate using estimators that are robust in the sense that they enjoy root n consistent uniformly over a sufficiently rich class of distributions characterized by certain conditional expectations being estimable by user-chosen machine learning methods. We show that even asking for locally uniform estimation within such a class narrows down possible estimators to those parametrized by certain weight functions and develop a new random forest-based estimation scheme to estimate the optimal weights. We demonstrate the effectiveness of the resulting estimator in a variety of semiparametric settings on simulated and real-world data.
Source
For talks more than 180 days ago please have a look at the Munich Mathematical Calendar (filter: "Oberseminar Statistics and Data Science").