Seminar on Statistics and Data Science

This seminar series is organized by the research group in mathematical statistics and features talks on advances in methods of data analysis, statistical theory, and their applications.
The speakers are external guests as well as researchers from other groups at TUM.

All talks in the seminar series are listed in the Munich Mathematical Calendar.

 

The seminar takes place in room BC1 2.01.10 under the current rules and simultaneously via zoom. To stay up-to-date about upcoming presentations please join our mailing list. You will receive an email to confirm your subscription.

Zoom link

Join the seminar. Please use your real name for entering the session. The session will start roughly 10 minutes prior to the talk.

 

Upcoming talks

01.06.2022 12:15 Jack Kuipers (ETH Zürich): Efficient sampling for Bayesian networks and benchmarking their structure learning

Bayesian networks are probabilistic graphical models widely employed to understand dependencies in high-dimensional data, and even to facilitate causal discovery. Learning the underlying network structure, which is encoded as a directed acyclic graph (DAG) is highly challenging mainly due to the vast number of possible networks in combination with the acyclicity constraint, and a wide plethora of algorithms have been developed for this task. Efforts have focused on two fronts: constraint-based methods that perform conditional independence tests to exclude edges and score and search approaches which explore the DAG space with greedy or MCMC schemes. We synthesize these two fields in a novel hybrid method which reduces the complexity of Bayesian MCMC approaches to that of a constraint-based method. This enables full Bayesian model averaging for much larger Bayesian networks, and offers significant improvements in structure learning. To facilitate the benchmarking of different methods, we further present a novel automated workflow for producing scalable, reproducible, and platform-independent benchmarks of structure learning algorithms. It is interfaced via a simple config file, which makes it accessible for all users, while the code is designed in a fully modular fashion to enable researchers to contribute additional methodologies. We demonstrate the applicability of this workflow for learning Bayesian networks in typical data scenarios. References: doi:10.1080/10618600.2021.2020127 and arXiv:2107.03863
more

15.06.2022 12:15 Harry Joe (University of British Columbia, CAN): Comparison of dependence graphs based on different functions of correlation matrices

t.b.a.
more

22.06.2022 12:15 Han Li (University of Melbourne, AUS): Joint Extremes in Temperature and Mortality: A Bivariate POT Approach

This research project contributes to insurance risk management by modeling extreme climate risk and extreme mortality risk in an integrated manner via extreme value theory (EVT). We conduct an empirical study using monthly temperature and death data and find that the joint extremes in cold weather and old-age death counts exhibit the strongest level of dependence. Based on the estimated bivariate generalized Pareto distribution, we quantify the extremal dependence between death counts and temperature indexes. Methodologically, we employ the bivariate peaks over threshold (POT) approach, which is readily applicable to a wide range of topics in extreme risk management.
more

22.06.2022 13:15 Hans Manner (University of Graz, AT): Testing the equality of changepoints (joint with Siegfried Hörmann, TU Graz)

Testing for the presence of changepoints and determining their location is a common problem in time series analysis. Applying changepoint procedures to multivariate data results in higher power and more precise location estimates, both in online and offline detection. However, this requires that all changepoints occur at the same time. We study the problem of testing the equality of changepoint locations. One approach is to treat common breaks as a common feature and test, whether an appropriate linear combination of the data can cancel the breaks. We propose how to determine such a linear combination and derive the asymptotic distribution resulting CUSUM and MOSUM statistics. We also study the power of the test under local alternatives and provide simulation results of its nite sample performance. Finally, we suggest a clustering algorithm to group variables into clusters that are co-breaking.
more

06.07.2022 12:15 Anastasios Panagiotelis (University of Sydney, AUS): Anomaly detection with kernel density estimation on manifolds

Manifold learning can be used to obtain a low-dimensional representation of the underlying manifold given the high-dimensional data. However, kernel density estimates of the low-dimensional embedding with a fixed bandwidth fail to account for the way manifold learning algorithms distort the geometry of the underlying Riemannian manifold. We propose a novel kernel density estimator for any manifold learning embedding by introducing the estimated Riemannian metric of the manifold as the variable bandwidth matrix for each point. The geometric information of the manifold guarantees a more accurate density estimation of the true manifold, which subsequently could be used for anomaly detection. To compare our proposed estimator with a fixed-bandwidth kernel density estimator, we run two simulations with 2-D metadata mapped into a 3-D swiss roll or twin peaks shape and a 5-D semi-hypersphere mapped in a 100-D space, and demonstrate that the proposed estimator could improve the density estimates given a good manifold learning embedding and has higher rank correlations between the true and estimated manifold density. A shiny app in R is also developed for various simulation scenarios. The proposed method is applied to density estimation in statistical manifolds of electricity usage with the Irish smart meter data. This demonstrates our estimator's capability to fix the distortion of the manifold geometry and to be further used for anomaly detection in high-dimensional data.
more

Previous talks

25.05.2022 12:15 Oksana Chernova (Nationale Taras-Schewtschenko-Universität Kiew, Ukraine): Estimation in Cox proportional hazards model with measurement errors

The Cox proportional hazards model is a semiparametric regression model that can be used in medical research, engineering or insurance for investigating the association between the survival time (the so-called lifetime) of an object and predictor variables. We investigate the Cox proportional hazards model for right-censored data, where the baseline hazard rate belongs to an unbounded set of nonnegative Lipschitz functions, with fixed constant, and the vector of regression parameters belongs to a compact parameter set, and in addition, the time-independent covariates are subject to measurement errors. We construct a simultaneous estimator of the baseline hazard rate and regression parameter, present asymptotic results and discuss goodness-of-fit tests.
more

11.05.2022 16:15 Florentina Bunea (Cornell University): Surprises in topic model estimation and new Wasserstein document-distance calculations

Topic models have been and continue to be an important modeling tool for an ensemble of independent multinomial samples with shared commonality. Although applications of topic models span many disciplines, the jargon used to define them stems from text analysis. In keeping with the standard terminology, one has access to a corpus of n independent documents, each utilizing words from a given dictionary of size p. One draws N words from each document and records their respective count, thereby representing the corpus as a collection of n samples from independent, p-dimensional, multinomial distributions, each having a different, document specific, true word probability vector Π. The topic model assumption is that each Π is a mixture of K discrete distributions, that are common to the corpus, with document specific mixture weights. The corpus is assumed to cover K topics, that are not directly observable, and each of the K mixture components correspond to conditional probabilities of words, given a topic. The vector of the K mixture weights, per document, is viewed as a document specific topic distribution T, and is thus expected to be sparse, as most documents will only cover a few of the K topics of the corpus. Despite the large body of work on learning topic models, the estimation of sparse topic distributions, of unknown sparsity, especially when the mixture components are not known, and are estimated from the same corpus, is not well understood and will be the focus of this talk. We provide estimators of T, with sharp theoretical guarantees, valid in many practically relevant situations, including the scenario p >> N (short documents, sparse data) and unknown K. Moreover, the results are valid when dimensions p and K are allowed to grow with the sample sizes N and n. When the mixture components are known, we propose MLE estimation of the sparse vector T, the analysis of which has been open until now. The surprising result, and a remarkable property of the MLE in these models, is that, under appropriate conditions, and without further regularization, it can be exactly sparse, and contain the true zero pattern of the target. When the mixture components are not known, we exhibit computationally fast and rate optimal estimators for them, and propose a quasi-MLE estimator of T, shown to retain the properties of the MLE. The practical implication of our sharp, finite-sample, rate analyses of the MLE and quasi-MLE reveal that having short documents can be compensated for, in terms of estimation precision, by having a large corpus. Our main application is to the estimation of Wasserstein distances between document generating distributions. We propose, estimate and analyze Wasserstein distances between alternative probabilistic document representations, at the word and topic level, respectively. The effectiveness of the proposed Wasserstein distance estimates, and contrast with the more commonly used Word Mover Distance between empirical frequency estimates, is illustrated by an analysis of an IMDb movie reviews data set. Brief Bio: Florentina Bunea obtained her Ph.D. in Statistics at the University of Washington, Seattle. She is now a Professor of Statistics in the Department of Statistics and Data Science, and she is affiliated with the Center for Applied Mathematics and the Department of Computer Science, at Cornell University. She is a fellow of the Institute of Mathematical Statistics, and she is or has been part of numerous editorial boards such as JRRS-B, JASA, Bernoulli, the Annals of Statistics. Her work has been continuously funded by the US National Science Foundation. Her most recent research interests include latent space models, topic models, and optimal transport in high dimensions.
more

10.05.2022 12:45 Marten Wegkamp (Cornell University, Ithaca, New York): Optimal Discriminant Analysis in High-Dimensional Latent Factor Models

In high-dimensional classification problems, a commonly used approach is to first project the high-dimensional features into a lower dimensional space, and base the classification on the resulting lower dimensional projections. In this talk, we formulate a latent-variable model with a hidden low-dimensional structure to justify this two-step procedure and to guide which projection to choose. We propose a computationally efficient classifier that takes certain principal components (PCs) of the observed features as projections, with the number of retained PCs selected in a data-driven way. A general theory is established for analyzing such two-step classifiers based on any low-dimensional projections. We derive explicit rates of convergence of the excess risk of the proposed PC-based classifier. The obtained rates are further shown to be optimal up to logarithmic factors in the minimax sense. Our theory allows, but does not require, the lower-dimension to grow with the sample size and the feature dimension exceeds the sample size. Simulations support our theoretical findings. This is joint work with Xin Bing (Department of Statistical Sciences, University of Toronto).
more

03.05.2022 13:00 Anna-Laura Sattelberger (Max Planck Institute for Mathematics in the Sciences, Leipzig): Bayesian Integrals on Toric Varieties

Toric varieties have a strong combinatorial flavor: those algebraic varieties are described in terms of a fan. Based on joint work with M. Borinsky, B. Sturmfels, and S. Telen (https://arxiv.org/abs/2204.06414), I explain how to understand toric varieties as probability spaces. Bayesian integrals for discrete statistical models that are parameterized by a toric variety can be computed by a tropical sampling method. Our methods are motivated by the study of Feynman integrals and positive geometries in particle physics.
more

23.02.2022 17:00 Kailun Zhu (TU Delft): Regular vines with strongly chordal pattern of conditional independence

In this talk the relationship between strongly chordal graphs and m-saturated vines (regular vines with certain nodes removed or assigned with independence copula) is proved. Moreover, an algorithm to construct an m-saturated vine structure corresponding to a strongly chordal graph is provided. When the underlying data is sparse this approach leads to improvements in an estimation process as compared to current heuristic methods. Furthermore due to reduction of model complexity it is possible to evaluate all vine structures as well as to fit non-simplified vines.
more