NOML
Nonlinear Optimization and Machine Learning I

Invited Session
Time Slot: Wednesday Afternoon
Room: 003
Chair: Matteo Lapucci

Implicit bias of gradient descent for learning linear neural networks

Time: 17:30

Holger Rauhut (RWTH Aachen University, Chair for Mathematics of Information Processing)

Deep neural networks are usually trained by minimizing a non-convex loss functional via (stochastic) gradient descent methods. A puzzling empirical observation is that learning neural networks with a number of parameters exceeding the number of training examples often leads to zero loss, i.e., the network exactly interpolates the data. Nevertheless, it generalizes very well to unseen data, which is in stark contrast to intuition from classical statistics which would predict a scenario of overfitting. A current working hypothesis is that the chosen optimization algorithm has a significant influence on the selection of the learned network. In fact, in this overparameterized context there are many global minimizers so that the optimization method induces an implicit bias on the computed solution. It seems that gradient descent methods and their stochastic variants favor networks of low complexity (in a suitable sense to be understood), and, hence, appear to be very well suited for large classes of real data.Initial attempts in understanding the implicit bias phenomen considers the simplified setting of linear networks, i.e., (deep) factorizations of matrices. This has revealed a surprising relation to the field of low rank matrix recovery (a variant of compressive sensing) in the sense that gradient descent favors low rank matrices in certain situations. Moreover, restricting further to diagonal matrices, or equivalently factorizing the entries of a vector to be recovered, shows connection to compressive sensing and l1-minimization.After giving a general introduction to these topics, the talk will concentrate on results by the speaker on the convergence of gradient flows and gradient descent for learning linear neural networksand on the implicit bias towards low rank and sparse solutions.

Contextual Decision Trees

Time: 17:50

Tommaso Aldinucci (Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Firenze), Enrico Civitelli, Leonardo Di Gangi

Focusing on Random Forests, we propose a multi-armed contextual bandit recommendation framework for feature-based selection of a single shallow tree of the learned ensemble. The trained system, which works on top of the Random Forest, dynamically identifies a base predictor that is responsible for providing the final output. In this way, we obtain local interpretations by observing the rules of the recommended tree. The carried out experiments reveal that our dynamic method is superior to an independent fitted CART decision tree and comparable to the whole black-box Random Forest in terms of predictive performances.

A Multi-Objective Programming Problem for Sparse Optimization with Application in SVM Feature Selection

Time: 18:10

Behzad Pirouz (Dimes, Universita della Calabria), Gaudioso Manlio

We propose a novel Multi-Objective Optimization (MO) model for sparse optimization based on the polyhedral k-norm. We put particular emphasis on the application of sparse optimization in Feature Selection for Support Vector Machine (SVM) classification. In our MO model, the first objective is the classic SVM classification error, while the second one is the number of significant features, that is, the number of nonzero components of the normal vector to the separating hyperplane. The model we consider is the continuous relaxation of an MINLP model. The objective is to calculate the Pareto optimal solutions, leaving it to the decision-maker to evaluate the tradeoff and the best compromise between the two objectives.We present some numerical results obtained using MO algorithms for our model. They are compared with those of several existing models based on the and norm in single and multi-objective format.