06.05.2026 16:15 Daniela M. Witten (University of Washington, Seattle): Data Thinning and beyond
Contemporary data analysis pipelines often involve the use and reuse of data. For instance, a scientist may explore a dataset to select an interesting hypothesis, and then wish to test this hypothesis with the same data. From a statistical perspective, this double use of data is highly problematic: it induces dependence between the hypothesis generation and testing stages, which complicates inference. Failure to account for this dependence renders classical inference techniques invalid.
I will present "data thinning", a set of strategies for obtaining independent training and test sets so that the former can be used to select a hypothesis, and the latter to test it. Data thinning enables valid selective inference in settings for which no solutions were previously available. However, it is also restrictive, in the sense that it requires strong distributional assumptions. Therefore, I will also present two strategies inspired by data thinning that enable valid post-selection inference without such assumptions. One strategy considers thinning summary statistics of the data, rather than the data itself, in order to take advantage of asymptotic properties of the summary statistics. The second strategy involves generating training and test sets that are not independent, and then orthogonalizing the latter with respect to the former in order to conduct valid inference.
Quelle
20.05.2026 12:15 Veronica Vinciotti (University of Trento, IT): t.b.a.
t.b.a.
Quelle
24.06.2026 12:15 Saber Salehkaleybar (Leiden University, NL): t.b.a.
t.b.a.
Quelle
01.07.2026 12:15 Fang Han (University of Washington, Seattle): t.b.a.
t.b.a.
Quelle