Sparse Matrix Factorization
Behnam Neyshabur1 and Rina Panigrahy2
arXiv:1311.3315v3 [cs.LG] 13 May 2014
Toyota Technological Institute at Chicago
Abstract. We investigate the problem of factoring a matrix into several sparse matrices and propose an algorithm for this under randomness and sparsity assumptions. This problem can be viewed as a simplification of the deep learning problem where finding a factorization corresponds to finding edges in different layers and also values of hidden units. We prove that under certain assumptions on a sparse linear deep network with n nodes in each layer, our algorithm is able to recover the structure of the ˜ 1/6 ).
network and values of top layer hidden units for depths up to O(n We further discuss the relation among sparse matrix factorization, deep learning, sparse recovery and dictionary learning.
Keywords: Sparse Matrix Factorization, Dictionary Learning, Sparse Encoding, Deep Learning
In this paper we study the following matrix factorization problem. The sparsity π(X) of a matrix X is the number of non-zero entries in X.
Problem 1 (Sparse Matrix-Factorization). Given an input matrix Y factorize it is as Y = X1 X2 . . . Xs so as minimize the total sparsity si=1 π(Xi ). The above problem is a simplification of the non-linear version of the problem that is directly related to learning using deep networks.
Problem 2 (Non-linear Sparse Matrix-Factorization). Given matrix Y , minimize si=1 π(Xi ) such that σ(X1 .σ(X2 .σ(. . . Xs ))) = Y where σ(x) is the sign function (+1 if x > 0, −1 if x < 0 and 0 otherwise) and σ applied on a matrix is simply applying the sign function on each entry. Here entries in Y are 0, ±1. Connection to Deep Learning and Compression: The above problem is related to learning using deep networks (see ) that are generalizations of neural networks. They are layered network of nodes connected by edges between successive layers; each node applies a non-linear operation (usually a sigmoid
Behnam Neyshabur and Rina Panigrahy
or a perceptron) on the weighted combination of inputs along the edges. Given the non-linear sigmoid function and the deep layered structure, they can express any circuit. The weights of the edges in a deep network with s layers may be represented by the matrices X1 , . . . , Xs . If we use the sign function instead of the step function, the computation in the neural network would exactly correspond to computing Y = σ(X1 .σ(X2 .σ(. . . Xs ))). Here Xs would correspond to the matrix of inputs at the top layer.
There has been a strong resurgence in the study of deep networks resulting in major breakthroughs in the field of machine learning by Hinton and others [7,11,5]. Some of the best state of the art methods use deep networks for several applications including speech, handwriting, and image recognition [8,6,15]. Traditional neural networks were typically used for supervised learning and are trained using the gradient descent style back propagation algorithm. More recent variants have been using unsupervised learning for pre-training, where the deep network can be viewed as a generative model for the observed data Y . The goal then is to learn from Y the network structure and the inputs that are encoded by the matrices X1 , . . . , Xs . In one variant called Deep Boltzmann Machines, each layer is a Restricted Boltzmann Machines (RBM) that are reversible in the sense that inputs can be produced from outputs by inverting the network . Auto-encoders are another variant to learn deep structures in the data . One of the main differences between auto-encoders and RBMs is that in an RBM, the weights of edges for generating the observed data is the same as recovering hidden variables, i.e. the encoding and decoding functions are the same; however, auto-encoders allow different encoder and decoders . Some studies have shown...
References: 3. Y. Bengio. Learning deep architectures for ai. Foundations and Trends in Machine
13. R. Salakhutdinov and G. E. Hinton. Deep boltzmann machines. Journal of Machine
Learning Research, 5:448–455, 2009.
15. Li. Wan, Matthew. Zeiler, Sixin. Zhang, Yann. LeCun, and Rob. Fergus. Regularization of neural networks using dropconnect. ICML, 2013.
16. P. M. Wood. Universality and the circular law for sparse random matrices. The
Annals of Applied Probability, 22(3):1266–1300, 2012.
Please join StudyMode to read the full document