Naftali Tishby, Hebrew University of Jerusalem
In the past several years we have developed a comprehensive theory of large scale learning with Deep Neural Networks (DNN), when optimized with Stochastic Gradient Decent (SGD). The theory is built on three theoretical components: (1) rethinking the standard (PAC like) distribution independent worse case generalisation bounds - turning them to problem dependent typical (in the Information Theory sense) bounds that are independent of the model architecture.
(2) The Information Plane theorem: For large scale typical learning the sample-complexity and accuracy tradeoff is characterized by only two numbers: the mutual information that the representation (a layer in the network) maintain on the input patterns, and the mutual information each layer has on the desired output label. The Information Theoretic optimal tradeoff between thees encoder and decoder information values is given by the Information Bottleneck (IB) bound for the rule specific input-output distribution. (3) The layers of the DNN reach this optimal bound via standard SGD training, in high (input & layers) dimension.
In this talk I will briefly review these results and discuss two new surprising outcomes of this theory: (1) The computational benefit of the hidden layers, (2) the emerging understanding of the features encoded by each layers which follows from the convergence to the IB bound.
Based on joint works with Noga Zaslavsky, Ravid Ziv, and Amichai Painsky.
Naftali Tishby is the Ruth & Stan Flinkman Professor in Brain Research at the Hebrew University of Jerusalem, where he is a member of The Benin School of Computer Science and Engineering and The Edmond and Lilly Safra Center for Brain Sciences. Educated as physicist, he has made profound contributions to problems ranging from chemical reaction dynamics to speech recognition, and from natural language processing to the dynamics of real neural networks in the brain. In the late 1980s Tishby and colleagues recast learning in neural networks as a statistical physics problem, and went on to discover that learning in large networks could show phase transitions, as exposure to increasing numbers of examples “cools” the parameters of the network into a range of values that provides qualitatively better performance. Most recently he has emerged as one of the leading figures in efforts to understand the success of deep learning, and this will be the topic of his seminar.
We have set aside two hours, in the hopes of encouraging greater interaction and discussion.
Download the event flier here.
Sponsored by the Initiative for the Theoretical Sciences, and by the CUNY doctoral programs in Physics and Biology.
Supported in part by the Center for the Physics of Biological Function, a joint effort of The Graduate Center and Princeton University