# Research Area B2

# High dimensionality and data analysis

## Research Area Leaders: Michael Griebel, Alois Kneip

## PIs: Michael Griebel, Alois Kneip

## Contributions by Christian Bayer, Christoph Breunig, Jürgen Dölz, Joachim Freyberger, Jochen Garcke, Dominik Liebl, Christian Rieger

## Topics and goals

The efficient treatment of large sets of high-dimensional data by machine learning methods is a major challenge in big data applications. This project focuses on exploiting the low intrinsic dimensionality of nominally high-dimensional data sets to develop numerical and statistical techniques for the transformation and subsequent analysis of these sets. Our research will be driven by applications in econometrics, macroeconomics, biostatistics, and engineering.

## State of the art, our expertise

**Function-valued data analysis**. An integral part of our research will be the statistical analysis of high-dimensional, and especially function-valued, data. Examples are wealth distributions over time, daily electricity prices as functions of demand, or psychometric biomarker trajectories. Here, function values may have been directly recorded or must be recovered from discrete, noisy observations. Thus, analysis faces a multitude of challenging problems. For example, under different setups, Kneip et al. [KSS12] and Vogt et al. [BLV15] used functional factor models to analyze economic panel data. Since high-dimensional multivariate data can often be interpreted as discretized observations of individual trajectories of a continuous stochastic process, functional regression relates to high-dimensional multivariate regression. This point of view introduces a perspective to deal with heavily correlated input variables, a notorious problem in sparse regression problems. Kneip et al. [KPS16] showed that in functional regression, points in the domain that possess significant impact on a response variable can be identified with high accuracy by decorrelation based on difference functionals quantifying local covariation; see Figure (a).

**Cluster analysis**. Clustering algorithms are a prominent tool to process large-scale data sets. Virtually all methods for clustering functional data rely on a nonparametric regression model with fixed design points. Recently, Vogt–Linton [VL17] developed a thresholding approach which applies to both fixed and random design models under general conditions. Although there are results on the convergence behavior of most clustering algorithms, statistical confidence statements about the estimation error produced by them are usually not available. In recent work with Schmidt, Vogt developed a sequential testing procedure for the number of clusters which allows for rigorous error control. Grid-based learning and dimensionality reduction. While kernel-based learning approaches like support vector regression are very popular, their computational costs usually scale quadratically in the amount of data. Furthermore, if the kernel is given as an infinite series expansion, it has to be truncated properly; see [GRZ15]. Bohn–Griebel showed that sparse grids provide a good alternative for big data problems. In [BG17], they analyzed the regression error, complementing the results of Smale et al. for reproducing kernel Hilbert spaces. Even for highly nonlinear problems, data often stem from a low-dimensional manifold in the ambient space. The identification of this intrinsic structure with dimension-adaptive sparse grids was successfully tackled by Bohn– Garcke–Griebel [BGG16]. Deep neural networks as studied by Bengio et al. can also detect the intrinsic manifold in big data problems. However, it is not straightforward to justify their success mathematically. First approaches in this direction were investigated for instance by Mallat et al. and Bölcskei et al. for convolutional networks.

**Data mining on small data sets**. Many machine learning approaches, like deep neural networks, rely on the availability of a large number of labeled data points. However, this is often unrealistic for engineering or scientific applications. Garcke et al. [AGH+16] showed that for physical processes, the available mathematical models, e.g., partial differential equations, and numerical simulations thereof can be used as additional data sources to identify invariants and problemadapted distance terms; see Figure (b). The approach of Bachmayr–Dahmen [BD16] shows how to solve such potentially very high-dimensional partial differential equations with adaptive algorithms using nonlinear combinations of sparse and low-rank approximations. The authors analyzed the complexity of computational cost with respect to the guaranteed error in the energy norm. The resulting methods can be applied, for example, to a broad class of uncertainty quantification problems. So far, these results assume full knowledge of all problem data.

**Macroeconomic models**. In macroeconomic models of heterogeneous agents, both the numerical and the statistical techniques outlined above can be applied. To model the feedback of heterogeneity and inequality on and with aggregate variables, one needs to describe the evolution of the distributions of agents across wealth, income, and portfolio positions. Here, global solution techniques reach their limit when it comes to multi-dimensional models with endogenous heterogeneity as in Bayer–Tjaden [BT16]. Mixed techniques solve the individual economic agents’ decision problem globally, but approximate the aggregate dynamics locally in terms of difference equations. Due to the curse of dimensionality, the treatment of high-dimensional heterogeneity constitutes a major problem. Assuming a fixed copula to model the effect of an increase in idiosyncratic income risks on the aggregate economy, Bayer–Lütticke–Pham-Dao–Tjaden recently applied sparse grids to approximate the dynamics of distributions.

## Research program

**Function-valued data analysis**. We aim to develop a framework generalizing functional and sparsity based regression in order to establish new techniques combining Lasso with the decorrelation approach in [KPS16]. A challenging goal will be the efficient statistical analysis of functional data exhibiting so-called phase variation. Indeed, many data sets in biomedicine or speech recognition, for example, consist of functions possessing a common pattern of peaks and valleys with random time and amplitude variation. Recently, Wagner–Kneip showed the limitations of existing methods and proposed an algorithm for identifying a suitable nonlinear subspace characterizing the data. But crucial theoretical and methodological questions remain to be resolved and will be a central research theme. In this context, we will investigate algorithms for manifold learning, as developed for instance by Garcke and Griebel, providing a promising approach in this direction.

**Cluster analysis**. Virtually all of the proposed methods for clustering nonparametric functions heavily depend on a number of bandwidth or smoothing parameters, whose choice may strongly influence the results. An important issue is to develop techniques for clustering nonparametric functions which are free of smoothing parameters. We aim to achieve this with the help of ideas from statistical multiscale testing. Another challenge is to devise methods with rigorous statistical error control for general clusters. The recent results by Vogt–Schmid are restricted to the case of convex spherical clusters. Markedly different techniques are needed to deal with more general cluster shapes. We will approach this problem with the help of kernel methods as used in kernel k-means clustering, for instance.

**Grid-based learning and dimensionality reduction**. Our goal is to establish a theoretical framework based on the analysis of Bohn–Griebel [BG17] for more general regression algorithms. To this end, enhancing the stability analysis to more complex bases is crucial. To compare our results to methods based on truncated multiscale kernels, we will rely on the work of RA C4 con- cerning the stable construction of such bases. In order to detect the intrinsic nonlinear structure of complex data sets, we will build dimensionality reduction algorithms which minimize the number of relevant coordinates of a sampled function by domain transformations. This will allow for an efficient use of multilevel tensor-product spaces like sparse grids. Since this approach represents a special type of manifold learning method, its application to the high-dimensional and also function-valued setting, as studied by Kneip and Vogt, and to the setting of shape space manifolds, as investigated in RA B3, will be of special interest. We also aim for more flexible algorithms using compositions of kernel translates to automatically learn dimensionality reduction maps. Since these compositions can be understood as neural networks, we will study the impact of our analysis on the convergence theory of deep learning.

**Data mining on small data sets**. We aim to extend the approach of identifying invariants and problem-adapted distance measures from numerical simulation data to situations where strongly varying environmental influences are present, such as in sensor signals from wind turbines. A key objective is to identify robust features that allow us to detect changes in the physical parameters of the underlying system, even if it is subject to constantly changing deterministic and stochastic excitations. However, the numerical treatment of such complex models can become very costintensive. To overcome this problem, the further development of adaptive methods to work with limited and uncertain input data, as pursued by Bachmayr, will be crucial. In this context, we will consider a combination with hierarchical Bayesian models to incorporate complex regularizing information and obtain a well-posed problem. A particular aim is to understand and manage the complexity of computational cost of adaptive solvers in such data-driven settings, which also plays an important role in IRU D1. As recent results by Bachmayr et al. [BCDM17] show, their performance crucially depends on an appropriate choice of coordinates in high dimensions.

**Macroeconomic models**. A macroeconomic application of our research, complementing the microeconomic studies in RA C2, lies in the empirical and theoretical analysis of policies and shocks through their impacts on wealth and income distributions. This requires the development of filtering techniques for mixed-frequency functional data where, for example, the income-marginal of the wealth-income distribution is observed more regularly than the joint distribution, but less frequently than the mean. Combining the expertise of the contributors to this RA, we will develop adaptive methods for dimensionality reduction to study the macroeconomic effects of changes in the liquidity of housing markets, taking into account the age structure of households.

## Summary

In many applications, an efficient analysis has to identify low-dimensional manifolds characterizing the intrinsic structure of the data and then exploit it by modeling the relation between the input data, its low-dimensional representation, and the desired output variable. This requires new methodologies at the intersection of numerical and statistical analysis. We aim for efficient data mining and dimensionality reduction methods, which aid our understanding of high-dimensional state-of-the-art problems and lead to novel algorithms. There are strong links to RA C4, where multiscale approximations are investigated, and to IRU D1, where large molecular simulation data bases are of particular importance. Furthermore, the methods and results of this research area will be relevant to the machine learning aspects of IRUs D3–5.

## Bibliography

[AGH+16] A. Aguilera, R. Grunzke, D. Habich, J. Luong, D. Schollbach, U. Markwardt, and J. Garcke. Advancing a gateway infrastructure for wind turbine data analysis. *J. Grid Computing, 14(4):499–514, 2016*.

[BCDM17] M. Bachmayr, A. Cohen, R. DeVore, and G. Migliorati. Sparse polynomial approximation of parametric elliptic PDEs. Part II: lognormal coefficients.* ESAIM Math. Model. Numer. Anal., 51:341–363, 2017*.

[BD16] M. Bachmayr and W. Dahmen. Adaptive low-rank methods: Problems on Sobolev spaces. *SIAM J. Numer. Anal., 54:744–796, 2016*.

[BG17] B. Bohn and M. Griebel. Error estimates for multivariate regression on discretized function spaces.* SIAM J. Numer. Anal., 55(4):1843–1866, 2017*.

[BGG16] B. Bohn, J. Garcke, and M. Griebel. A sparse grid based method for generative dimensionality reduction of high-dimensional data. *J. Comput. Phys., 309:1–17, 2016*.

[BLV15] L. Boneva, O. Linton, and M. Vogt. A semiparametric model for heterogeneous panel data with fixed effects.* J. Econometrics, 188:327–345, 2015*.

[BT16] C. Bayer and V. Tjaden. Large open economies and fixed costs of capital adjustment.* Rev. Econ. Dyn., 21:125–146, 2016*.

[GRZ15] M. Griebel, C. Rieger, and B. Zwicknagl. Multiscale approximation and reproducing kernel Hilbert space methods. *SIAM J. Numer. Anal., 53(2):852–873, 2015*.

[KPS16] A. Kneip, D. Poss, and P. Sarda. Functional linear regression with points of impact. *Ann. Statist., 44:1–30, 2016*.

[KSS12] A. Kneip, R. Sickles, and W. Song. A new panel data treatment for heterogeneity in time trends. *Econometric Theory, 28:590–628, 2012*.

[VL17] M. Vogt and O. Linton. Classification of non-parametric regression functions in longitudinal data models.* J. R. Stat. Soc. Ser. B. Stat. Methodol., 79:5–27, 2017*.