Seminar on Statistics and Data Science

23.06.2025 12:15 Speaker has cancelled: Qingqing Zhai (Shanghai University, CN): Modeling Complex System Deterioration: From Unit Degradation to Networked Recurrent Failures

This presentation addresses statistical challenges in modeling the deterioration of complex systems, spanning from individual unit degradation to interdependent network failures. First, we introduce statistical degradation data modeling using stochastic processes. Then, we shift to modeling recurrent failures in large-scale infrastructure networks (e.g., water distribution systems). Motivated by 16 years of Scottish Water pipe failure data, we propose the novel Network Gamma-Poisson Autoregressive NHPP (GPAN) model. This two-layer framework captures temporal dynamics via Non-Homogeneous Poisson Processes (NHPPs) with node-specific frailties and spatial dependencies through a gamma-Poisson autoregressive scheme structured by the network's Directed Acyclic Graph (DAG). To overcome computational intractability, a scalable sum-product algorithm based on factor graphs and message passing is developed for efficient inference, enabling application to networks with tens of thousands of nodes. We demonstrate how this approach provides accurate failure predictions, identifies high-risk clusters, and supports operational management and risk assessment. The methodologies presented offer powerful tools for reliability analysis across diverse engineering contexts, from product lifespan prediction to critical infrastructure resilience.

Source

04.06.2025 12:15 Gilles Blanchard (Université Paris-Saclay, FR): Estimating a large number of high-dimensional vector means

The problem of simultaneously estimating multiple means from independent samples has a long history in statistics, from the seminal works of Stein, Robbins in the 50s, Efron and Morris in the 70s and up to the present day. This setting can be also seen as an (extremely stylized) instance of "personalized federated learning" problem, where each user has their own data and target (the mean of their personal distribution), but potentially want to share some relevant information with "similar" users (though there is no information available a priori about which users are "similar"). In this talk I will concentrate on contributions to the high-dimensional case, where the samples and their means belong to R^d with "large" d. \[ \] We consider a weighted aggregation scheme of empirical means of each sample, and study the possible improvement in quadratic risk over the simple empirical means. To make the stylized problem closer to challenges encountered in practice, we allow (a) full heterogeneity of sample sizes (b) zero a priori knowledge of the structure of the mean vectors (c) unknown and possibly heterogeneous sample covariances. \[ \] We focus on the role of the effective dimension of the data in a "dimensional asymptotics'' point of view, highlighting that the risk improvement of the proposed method satisfies an oracle inequality approaching an adaptive (minimax in a suitable sense) improvement as the effective dimension grows large. \[ \] (This is joint work with Jean-Baptiste Fermanian and Hannah Marienwald)

Source

21.05.2025 12:15 Michael Muma (TU Darmstadt): The T-Rex Selector: Fast High-Dimensional Variable Selection with False Discovery Rate Control

Providing guarantees on the reproducibility of discoveries is essential when drawing inferences from high-dimensional data. Such data is common in numerous scientific domains, for example, in biomedicine, it is imperative to reliably detect the genes that are truly associated with the survival time of patients diagnosed with a certain type of cancer, or in finance, one aims at determining a sparse portfolio to reliably perform index tracking. This talk introduces the Terminating-Random Experiments (T-Rex) selector, a fast multivariate variable selection framework for high-dimensional data. The T-Rex selector provably controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables. It scales to settings with millions of variables. Its computational complexity is linear in the number of variables, making it more than two orders of magnitude faster than, e.g., the existing model-X knockoff methods. An easy-to-use open-source R package that implements the TRexSelector is available on CRAN. The focus of this talk lies on high-dimensional linear regression models, but we also describe extensions to principal component analysis (PCA) and Gaussian graphical models (GGMs).

Source

14.05.2025 12:15 Luciana Dalla Valle (University of Torino, IT): Approximate Bayesian conditional copulas

According to Sklar’s theorem, any multidimensional absolutely continuous distribution function can be uniquely represented as a copula, which captures the dependence structure among the vector components. In real data applications, the interest of the analyses often lies on specific functionals of the dependence, which quantify aspects of it in a few numerical values. A broad literature exists on such functionals, however extensions to include covariates are still limited. This is mainly due to the lack of unbiased estimators of the conditional copula, especially when one does not have enough information to select the copula model. Several Bayesian methods to approximate the posterior distribution of functionals of the dependence varying according covariates are presented and compared; the main advantage of the investigated methods is that they use nonparametric models, avoiding the selection of the copula, which is usually a delicate aspect of copula modelling. These methods are compared in simulation studies and in two realistic applications, from civil engineering and astrophysics.

Source

14.05.2025 16:15 Rajen Shah (University of Cambridge, UK): Robustness in Semiparametric Statistics

Given that all models are wrong, it is important to understand the performance of methods when the settings for which they have been designed are not met, and to modify them where possible so they are robust to these sorts of departures from the ideal. We present two examples with this broad goal in mind. \[ \] We first look at a classical case of model misspecification in (linear) mixed-effect models for grouped data. Existing approaches estimate linear model parameters through weighted least squares, with optimal weights (given by the inverse covariance of the response, conditional on the covariates) typically estimated by maximizing a (restricted) likelihood from random effects modelling or by using generalized estimating equations. We introduce a new ‘sandwich loss’ whose population minimizer coincides with the weights of these approaches when the parametric forms for the conditional covariance are well-specified, but can yield arbitrarily large improvements when they are not. \[ \] The starting point of our second vignette is the recognition that semiparametric efficient estimation can be hard to achieve in practice: estimators that are in theory efficient may require unattainable levels of accuracy for the estimation of complex nuisance functions. As a consequence, estimators deployed on real datasets are often chosen in a somewhat ad hoc fashion and may suffer high variance. We study this gap between theory and practice in the context of a broad collection of semiparametric regression models that includes the generalized partially linear model. We advocate using estimators that are robust in the sense that they enjoy root n consistent uniformly over a sufficiently rich class of distributions characterized by certain conditional expectations being estimable by user-chosen machine learning methods. We show that even asking for locally uniform estimation within such a class narrows down possible estimators to those parametrized by certain weight functions and develop a new random forest-based estimation scheme to estimate the optimal weights. We demonstrate the effectiveness of the resulting estimator in a variety of semiparametric settings on simulated and real-world data.

Source

19.03.2025 12:15 Vincent Fortuin (Helmholtz/TUM): Recent Advances in Bayesian Deep Learning

Combining Bayesian principles with the power of deep learning has long been an attractive direction of research, but its real-world impact has fallen short of the promises. Especially in the context of uncertainty estimation, there seem to be simpler methods that perform at least as well. In this talk, I want to argue that uncertainties are not the only reason to use Bayesian deep learning models, but that they also offer improved model selection and incorporation of prior knowledge. I will showcase these benefits supported by the results of two recent papers and situate them in the context of current research trends in Bayesian deep learning. \[ \] Bio: Vincent Fortuin is a tenure-track research group leader at Helmholtz AI in Munich, leading the group for Efficient Learning and Probabilistic Inference for Science (ELPIS), and a faculty member at the Technical University of Munich. He is also a Branco Weiss Fellow, an ELLIS Scholar, a Fellow of the Konrad Zuse School of Excellence in Reliable AI, and a Senior Researcher at the Munich Center for Machine Learning. His research focuses on reliable and data-efficient AI approaches, leveraging Bayesian deep learning, deep generative modeling, meta-learning, and PAC-Bayesian theory. Before that, he did his PhD in Machine Learning at ETH Zürich and was a Research Fellow at the University of Cambridge. He is a regular reviewer and area chair for all major machine learning conferences, an action editor for TMLR, and a co-organizer of the Symposium on Advances in Approximate Bayesian Inference (AABI) and the ICBINB initiative.

Source

19.02.2025 12:15 Jane Coons (Max Planck Institute of Molecular Cell Biology and Genetics, Dresden): Iterative Proportional Scaling and Log-Linear Models with Rational Maximum Likelihood Estimator

In the field of algebraic statistics, we view statistical models as part of an algebraic variety and use tools from algebra, geometry, and combinatorics to learn statistically relevant information about these models. In this talk, we discuss the algebraic interpretation of likelihood inference for discrete statistical models. We present recent work on the iterative proportional scaling (IPS) algorithm, which is used to compute the maximum likelihood estimate (MLE), and give algebraic conditions under which this algorithm outputs the exact MLE in one cycle. Next, we introduce quasi-independence models, which describe the joint distribution of two random variables where some combinations of their states cannot co-occur, but they are otherwise independent. We combinatorially classify the quasi-independence models whose MLEs are rational functions of the data. We show that each of these has a parametrization which satisfies the conditions that guarantee one-cycle convergence of the IPS algorithm.

Source

05.02.2025 12:15 Cecilie Recke (University of Copenhagen, DK): Identifiability and Estimation in Continuous Lyapunov Models

We study causality in systems that allow for feedback loops among the variables via models of cross-sectional data from a dynamical system. Specifically, we consider the set of distributions which appears as the steady-state distributions of a stochastic differential equation (SDE) where the drift matrix is parametrized by a directed graph. The nth-order cumulant of the steady state distribution satisfies the corresponding nth-order continuous Lyapunov equation. Under the assumption that the driving Lévy process of the SDE is not a Brownian motion (so the steady state distribution is non-Gaussian) and the coordinates are independent, we are able to prove generic identifiability for any connected graph from the second and third-order Lyapunov equations while allowing the cumulants of the driving process to be unknown diagonal. We propose a minimum distance estimator of the drift matrix, which we are able to prove is consistent and asymptotically normal by utilizing the identifiability result.

Source

29.01.2025 12:15 Siegfried Hörmann (Graz University of Technology, AT): Measuring dependence between a scalar response and a functional covariate

We extend the scope of a recently introduced dependence coefficient between a scalar response Y and a multivariate covariate X to the case where X takes values in a general metric space. Particular attention is paid to the case where X is a curve. While on the population level, this extension is straight forward, the asymptotic behavior of the estimator we consider is delicate. It crucially depends on the nearest neighbor structure of the infinite-dimensional covariate sample, where deterministic bounds on the degrees of the nearest neighbor graphs available in multivariate settings do no longer exist. The main contribution of this paper is to give some insight into this matter and to advise a way how to overcome the problem for our purposes. As an important application of our results, we consider an independence test.

Source

22.01.2025 12:15 Ingrid van Keilegom (KU Leuven, BE): Semiparametric estimation of the survival function under dependent censoring

This paper proposes a novel estimator of the survival function under dependent random right censoring, a situation frequently encountered in survival analysis. We model the relation between the survival time T and the censoring C by using a parametric copula, whose association parameter is not supposed to be known. Moreover, the survival time distribution is left unspecified, while the censoring time distribution is modeled parametrically. We develop sufficient conditions under which our model for (T,C) is identifiable, and propose an estimation procedure for the distribution of the survival time T of interest. Our model and estimation procedure build further on the work on the copula-graphic estimator proposed by Zheng and Klein (1995) and Rivest and Wells (2001), which has the drawback of requiring the association parameter of the copula to be known, and on the recent work by Czado and Van Keilegom (2023), who suppose that both marginal distributions are parametric whereas we allow one margin to be unspecified. Our estimator is based on a pseudo-likelihood approach and maintains low computational complexity. The asymptotic normality of the proposed estimator is shown. Additionally, we discuss an extension to include a cure fraction, addressing both identifiability and estimation issues. The practical performance of our method is validated through extensive simulation studies and an application to a breast cancer data set.

Source

20.01.2025 13:45 Yuhao Wang (Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, CN; Shanghai Qi Zhi Institute, CN): Residual permutation test for regression coefficient testing

We consider the problem of testing whether a single coefficient is equal to zero in linear models when the dimension of covariates p can be up to a constant fraction of sample size n. In this regime, an important topic is to propose tests with finite-population valid size control without requiring the noise to follow strong distributional assumptions. In this paper, we propose a new method, called residual permutation test (RPT), which is constructed by projecting the regression residuals onto the space orthogonal to the union of the column spaces of the original and permuted design matrices. RPT can be proved to achieve finite-population size validity under fixed design with just exchangeable noises, whenever p n/2. Moreover, RPT is shown to be asymptotically powerful for heavy-tailed noises with bounded (1+t)-th order moment when the true coefficient is at least of order n^{-t/(1+t)} for t \in [0, 1]. We further proved that this signal size requirement is essentially rate-optimal in the minimax sense. Numerical studies confirm that RPT performs well in a wide range of simulation settings with normal and heavy-tailed noise distributions.

Source

15.01.2025 11:15 Elisabeth Maria Griesbauer (Institute of Basic Medical Sciences, Oslo, NO): Synthetic data generation balancing privacy and utility, using vine copulas

The availability of diverse, high-quality data has led to tremendous advances in science, technology and society at large, when analysed by means of statistical and machine learning (ML) methods. However, real-world data, in many cases, cannot be made public to the research community due to privacy restrictions, obstructing progress, especially in bio-medical research. Synthetic data can substitute the sensitive real data, and as long as they do not disclose private aspects. This has proven to be successful in training downstream ML applications. We propose TVineSynth, a vine copula based synthetic tabular data generator, which is designed to balance privacy and utility, using the vine tree structure and its truncation to do the trade-off. Contrary to synthetic data generators that achieve differential privacy (DP) by globally adding noise, TVineSynth performs a controlled approximation of the estimated data generating distribution, so that it does not suffer from poor utility of the resulting synthetic data for downstream prediction tasks. TVineSynth introduces a targeted bias into the vine copula model that, combined with the specific tree structure of the vine, causes the model to zero out privacy-leaking dependencies while relying on those that are beneficial for utility. Privacy is here measured with membership (MIA) and attribute inference attacks (AIA). Further, we theoretically justify how the construction of TVineSynth ensures AIA privacy under a natural privacy measure for continuous sensitive attributes. When compared to competitor models, with and without DP, on simulated and on real-world data, TVineSynth achieves a superior privacy-utility balance.

Source

15.01.2025 12:15 David Huk (University of Warwick, Coventry, UK): Quasi-Bayes meets Vines

Recently developed quasi-Bayesian (QB) methods proposed a stimulating change of paradigm in Bayesian computation by directly constructing the Bayesian predictive distribution through recursion, removing the need for expensive computations involved in sampling the Bayesian posterior distribution. This has proved to be data-efficient for univariate predictions, however, existing constructions for higher dimensional densities are only possible by relying on restrictive assumptions on the model's multivariate structure. In this talk, we discuss a wholly different approach to extend Quasi-Bayesian prediction to high dimensions through the use of Sklar's theorem, by decomposing the predictive distribution into one-dimensional predictive marginals and a high-dimensional copula. We use the efficient recursive QB construction for the one-dimensional marginals and model the dependence using highly expressive vine copulas. Further, we tune hyperparameters using robust divergences (eg. energy score) and show that our proposed Quasi-Bayesian Vine (QB-Vine) is a fully non-parametric density estimator with an analytical form and convergence rate independent of the dimension of the data in some situations. Our experiments illustrate that the QB-Vine is appropriate for high dimensional distributions (64), needs very few samples to train (200), and outperforms state-of-the-art methods with analytical forms for density estimation and supervised tasks by a considerable margin.

Source

08.01.2025 12:15 Hannah Laus (TUM) : Non-Asymptotic Uncertainty Quantification in High-Dimensional Learning

Uncertainty quantification (UQ) is a crucial but challenging task in many high-dimensional regression or learning problems to increase the confidence of a given predictor. In this talk we discuss a new data-driven approach for UQ in regression that applies both to classical regression approaches such as the LASSO as well as to neural networks. One of the most notable UQ techniques is the debiased LASSO, which modifies the LASSO to allow for the construction of asymptotic confidence intervals by decomposing the estimation error into a Gaussian and an asymptotically vanishing bias component. However, in real-world problems with finite-dimensional data, the bias term is often too significant to be neglected, resulting in overly narrow confidence intervals. In this talk we will address this issue and derive a data-driven adjustment that corrects the confidence intervals for a large class of predictors by estimating the means and variances of the bias terms from training data, exploiting high-dimensional concentration phenomena. This gives rise to non-asymptotic confidence intervals, which can help avoid overestimating uncertainty in critical applications such as MRI diagnosis. Importantly, this analysis extends beyond sparse regression to data-driven predictors like neural networks, enhancing the reliability of model-based deep learning. Our findings, discussed in this talk, bridge the gap between established theory and the practical applicability of such debiased methods. This talk is based on joint work with Frederik Hoppe, Claudio Mayrink Verdun, Felix Krahmer and Holger Rauhut.

Source

Upcoming talks

23.07.2025 12:15 Oezge Sahin (TU Delft, NL): t.b.a.

Previous talks

within the last 180 days

23.06.2025 12:15 Speaker has cancelled: Qingqing Zhai (Shanghai University, CN): Modeling Complex System Deterioration: From Unit Degradation to Networked Recurrent Failures

04.06.2025 12:15 Gilles Blanchard (Université Paris-Saclay, FR): Estimating a large number of high-dimensional vector means

21.05.2025 12:15 Michael Muma (TU Darmstadt): The T-Rex Selector: Fast High-Dimensional Variable Selection with False Discovery Rate Control

14.05.2025 12:15 Luciana Dalla Valle (University of Torino, IT): Approximate Bayesian conditional copulas

14.05.2025 16:15 Rajen Shah (University of Cambridge, UK): Robustness in Semiparametric Statistics

19.03.2025 12:15 Vincent Fortuin (Helmholtz/TUM): Recent Advances in Bayesian Deep Learning

19.02.2025 12:15 Jane Coons (Max Planck Institute of Molecular Cell Biology and Genetics, Dresden): Iterative Proportional Scaling and Log-Linear Models with Rational Maximum Likelihood Estimator

05.02.2025 12:15 Cecilie Recke (University of Copenhagen, DK): Identifiability and Estimation in Continuous Lyapunov Models

29.01.2025 12:15 Siegfried Hörmann (Graz University of Technology, AT): Measuring dependence between a scalar response and a functional covariate

22.01.2025 12:15 Ingrid van Keilegom (KU Leuven, BE): Semiparametric estimation of the survival function under dependent censoring

20.01.2025 13:45 Yuhao Wang (Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, CN; Shanghai Qi Zhi Institute, CN): Residual permutation test for regression coefficient testing

15.01.2025 11:15 Elisabeth Maria Griesbauer (Institute of Basic Medical Sciences, Oslo, NO): Synthetic data generation balancing privacy and utility, using vine copulas

15.01.2025 12:15 David Huk (University of Warwick, Coventry, UK): Quasi-Bayes meets Vines

08.01.2025 12:15 Hannah Laus (TUM) : Non-Asymptotic Uncertainty Quantification in High-Dimensional Learning