Seminar on Statistics and Data Science

This seminar series is organized by the research group in statistics and features talks on advances in methods of data analysis, statistical theory, and their applications. The speakers are external guests as well as researchers from other groups at TUM. All talks in the seminar series are listed in the Munich Mathematical Calendar.

The seminar takes place in room 8101.02.110, if not announced otherwise. To stay up-to-date about upcoming presentations please join our mailing list. You will receive an email to confirm your subscription.

Upcoming talks

06.12.2023 12:00 Richard Guo (University of Cambridge): Harnessing Extra Randomness: Replicability, Flexibility and Causality

Many modern statistical procedures are randomized in the sense that the output is a random function of data. For example, many procedures employ data splitting, which randomly divides the dataset into disjoint parts for separate purposes. Despite their flexibility and popularity, data splitting and other constructions of randomized procedures have obvious drawbacks. First, two analyses of the same dataset may lead to different results due to the extra randomness introduced. Second, randomized procedures typically lose statistical power because the entire sample is not fully utilized. \[ \] To address these drawbacks, in this talk, I will study how to properly combine the results from multiple realizations (such as through multiple data splits) of a randomized procedure. I will introduce rank-transformed subsampling as a general method for delivering large sample inference of the combined result under minimal assumptions. I will illustrate the method with three applications: (1) a “hunt-and-test” procedure for detecting cancer subtypes using high-dimensional gene expression data, (2) testing the hypothesis of no direct effect in a sequentially randomized trial and (3) calibrating cross-fit “double machine learning” confidence intervals. For these problems, our method is able to derandomize and improve power. Moreover, in contrast to existing approaches for combining p-values, our method enjoys type-I error control that asymptotically approaches the nominal level. This new development opens up the possibility of designing procedures that explicitly randomize and derandomize: extra randomness is introduced to make the problem easier before being marginalized out. \[ \] This talk is based on joint work with Prof. Rajen Shah. \[ \] Bio: Richard Guo is a research associate in the Statistical Laboratory at the University of Cambridge, mentored by Prof. Rajen Shah. Previously, he was the Richard M. Karp Research Fellow in the 2022 causality program at the Simons Institute for the Theory of Computing. He received his PhD in Statistics from University of Washington in 2021, advised by Thomas Richardson. His research interests include graphical models, causal inference, semiparametric methods and replicability of data analysis. Dr. Guo will start as a tenure-track assistant professor in Biostatistics at the University of Washington in 2024.

06.12.2023 13:00 Stefan Bauer (TUM): Learning Causal Representations: Explainable AI for Structured Exploration

Deep neural networks have achieved outstanding success in many tasks ranging from computer vision, to natural language processing, and robotics. However such models still pale in their ability to understand the world around us, as well as generalizing and adapting to new tasks or environments. One possible solution to this problem are causal models, since they can reason about the connections between causal variables and the effect of intervening on them. This talk will introduce the fundamental concepts of causal inference, connections and synergies with deep learning as well as practical applications and advances in sustainability and AI for science

Previous talks

within the last 180 days

15.11.2023 12:15 Simon Buchholz (Max Planck Institute for Intelligent Systems, Tübingen): Identifiability and Robustness in Causal Representation Learning

Many datasets for modern machine learning consist of high dimensional observations that are generated from some low dimensional latent variables. While recent advances in deep learning allow us to sample from distributions of almost arbitrary complexity, the recovery of the ground truth latent variable is still challenging even in simple settings. We study this problem through the lens of identifiability, i.e., when can we, at least theoretically, hope to recover the latent structure up to certain symmetries? We will present a general identifiability result for interventional data and a contrastive algorithm to find the latent variables. In the second part, we study the robustness of identifiability results to misspecification as one challenge for practical applications of representation learning. This talk is based on joined work with Goutham Rajendran, Elan Rosenfeld, Bryon Aragam, Bernhard Schölkopf, and Pradeep Ravikumar. Bio: Simon Buchholz received his PhD in mathematics from the University of Bonn where he was advised by Stefan Mueller. Currently he is a Postdoctoral Researcher with Bernhard Schölkopf in the department for Empirical Inference at the Max Planck Institute for Intelligent Systems in Tübingen where he works on problems in causal representation learning.

06.11.2023 12:15 Jordan Bryan (University of North Carolina): Application of least squares principles to water quality monitoring in North Carolina

Motivated by applications to water quality monitoring using fluorescence spectroscopy, we develop the source apportionment model for high dimensional profiles of dissolved organic matter (DOM). We describe simple methods to estimate the parameters of a linear source apportionment model, and we show how the estimates are related to those of ordinary and generalized least squares. Using this least squares framework, we analyze the variability of the estimates, and we propose predictors for missing elements of a DOM profile. We demonstrate the practical utility of our results on fluorescence spectroscopy data collected from the Neuse River in North Carolina.

06.11.2023 13:15 Andreas Gerhardus (Deutsches Zentrum für Luft und Raumfahrt, Jena): Novel developments in causal graphical models for time series

In this talk, we begin with a motivation for and brief introduction to causal graphical modeling of time series. We then discuss two recent works in this area. First, a complete characterization of a class of graphical models for describing lag-resolved causal relationships in the presence of latent confounders. This characterization sheds new light on existing time series causal discovery algorithms and shows that there is room for stronger identifiability results than previously thought. Second, a method for projecting infinite time series graphs with time-invariant edges to finite marginals graphs. We argue that the construction of these marginal graphs is a big step towards a method-agnostic generalization of causal effect identifiability results to time series.

27.09.2023 09:00 Peter Bühlmann, Vanessa Didelez, Mathias Drton, Robin Evans, Niels Richard Hansen, Dominik Janzing, Niki Kilbertus, Giusi Moffa, Ricardo Silva, Stijn Vansteelandt, Kun Zhang: Miniworkshop on Graphical Models and Causality

**September 27, 2023** 09:00-09:45 Stijn Vansteelandt (Ghent University) 09:45-10:30 Vanessa Didelez (Leibniz Institute for Prevention Research and Epidemiology - BIPS) break 11:00-11:45 Peter Bühlmann (ETH Zürich) 11:45-12:30 Dominik Janzing (Amazon Research) lunch 14:00-14:45 Giusi Moffa (University of Basel) 14:45-15:30 Ricardo Silva (University College London) **September 28, 2023** 10:00-10:45 Kun Zhang (Carnegie Mellon University) 10:45-11:30 Robin Evans (University of Oxford) break 11:45-12:30 Niels Richard Hansen (University of Copenhagen) lunch 14:00-14:45 Niki Kilbertus (Helmholtz / TUM) 14:45-15:30 Mathias Drton (TUM) See for more details.

17.07.2023 14:00 Han Li (The University of Melbourne, Victoria): Boosted mortality models with age and spatial shrinkage

This paper extends the technique of gradient boosting in mortality forecasting. The two novel contributions are to use stochastic mortality models as weak learners in gradient boosting rather than trees, and to include a penalty that shrinks the forecasts of mortality in adjacent age groups and nearby geographical regions closer together. The proposed method demonstrates superior forecasting performance based on US male mortality data from 1969 to 2019. The boosted model with age-based shrinkage yields the most accurate national-level mortality forecast. For state-level forecasts, spatial shrinkage provides further improvement in accuracy in addition to the benefits achieved by age-based shrinkage. This additional improvement can be attributed to data sharing across states with both large and small populations in adjacent regions, as well as states which share common risk factors.

22.06.2023 12:15 Harry Joe (University of British Columbia): Vine copula regression for observational studies

If explanatory variables and a response variable of interest are simultaneously observed, then multivariate models based on vine pair-copula constructions can be fit, from which inferences are based on the conditional distribution of the response variable given the explanatory variables. For applications, there are things to consider when implementing this idea. Topics include: (a) inclusion of categorical predictors; (b) right-censored response variable; (c) for a pair with one ordinal and one continuous variable, diagnostics for copula choice and assessing fit of copula; (d) use of empirical beta copula; (e) performance metrics for prediction/classification and sensitivity to choice of vine structure and pair-copulas on edges of vine; (f) weighted log-likelihood for ordinal response variable; (g) comparisons with linear regression methods.

14.06.2023 13:15 Marcel Wienöbst (Universität zu Lübeck): Linear-Time Algorithms for Front-Door Adjustment in Causal Graphs

Causal effect estimation from observational data is a fundamental task in empirical sciences. It becomes particularly challenging when unobserved confounders are involved in a system. This presentation provides an introduction to front-door adjustment – a classic technique which, using observed mediators, allows to identify causal effects even in the presence of unobserved confounding. Focusing on the algorithmic aspects, this talk presents recent results for finding front-door adjustment sets in linear-time in the size of the causal graph. Link to technical report:

12.06.2023 15:00 Jacob Bien (University of Southern California, Los Angeles): Generalized Data Thinning Using Sufficient Statistics

Sample splitting is one of the most tried-and-true tools in the data scientist toolbox. It breaks a data set into two independent parts, allowing one to perform valid inference after an exploratory analysis or after training a model. A recent paper (Neufeld, et al. 2023) provided a remarkable alternative to sample splitting, which the authors showed to be attractive in situations where sample splitting is not possible. Their method, called convolution-closed data thinning, proceeds very differently from sample splitting, and yet it also produces two statistically independent data sets from the original. In this talk, we will show that sufficiency is the key underlying principle that makes their approach possible. This insight leads naturally to a new framework, which we call generalized data thinning. This generalization unifies both sample splitting and convolution-closed data thinning as different applications of the same procedure. Furthermore, we show that this generalization greatly widens the scope of distributions where thinning is possible. This work is a collaboration with Ameer Dharamshi, Anna Neufeld, Keshav Motwani, Lucy Gao, and Daniela Witten.

07.06.2023 12:15 Manfred Denker (Penn State University): Monte Carlo estimation of multiple stochastic integrals and its statistical applications

Multiple stochastic integrals with respect to Brownian motion is a classical topic while its version with respect to stable processes has created minor interest. Their distributions can be simulated using U-statistics. This will be discussed in the first part of the talk. On the other hand this representation allows for statistical applications for observations with slowly decaying tail distributions. I shall present some simulations and give an application from neuroscience.

07.06.2023 13:15 Alexis Derumigny (Delft University of Technology): Conditional empirical copula processes and generalized measures of association

We study the weak convergence of conditional empirical copula processes indexed by general families of conditioning events that have non zero probabilities. Moreover, we also study the case where the conditioning events are chosen in a data-driven way. The validity of several bootstrap schemes is stated, including the exchangeable bootstrap. We define general multivariate measures of association, possibly given some fixed or random conditioning events. By applying our theoretical results, we prove the asymptotic normality of the estimators of such measures. We illustrate our results with financial data.

05.06.2023 15:30 Fang Han (University of Washington, Seattle): Chattejee's rank correlation: what is new?

This talk will provide an overview of the recent progress made in exploring Sourav Chatterjee's newly introduced rank correlation. The objective is to elaborate on its practical utility and present several new findings pertaining to (a) the asymptotic normality and limiting variance of Chatterjee's rank correlation, (b) its statistical efficiency for testing independence, and (c) the issue of its bootstrap inconsistency. Notably, the presentation will reveal that Chatterjee's rank correlation is root-n consistent, asymptotically normal, but bootstrap inconsistent - an unusual phenomenon in the literature.

For talks more than 180 days ago please have a look at the Munich Mathematical Calendar (filter: "Oberseminar Statistics and Data Science").