Nitish Srivastava Department of Computer Science University of Toronto firstname.lastname@example.org
Ruslan Salakhutdinov Department of Statistics and Computer Science University of Toronto email@example.com
A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a uniﬁed representation that fuses modalities together. We ﬁnd that this representation is useful for classiﬁcation and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and ﬁlling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model signiﬁcantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains.
Information in the real world comes through multiple input channels. Images are associated with captions and tags, videos contain visual and audio signals, sensory perception includes simultaneous inputs from visual, auditory, motor and haptic pathways. Each modality is characterized by very distinct statistical properties which make it difﬁcult to ignore the fact that they come from different input channels. Useful representations can be learned about such data by fusing the modalities into a joint representation that captures the real-world ‘concept’ that the data corresponds to. For example, we would like a probabilistic model to correlate the occurrence of the words ‘beautiful sunset’ and the visual properties of an image of a beautiful sunset and represent them jointly, so that the model assigns high probability to one conditioned on the other. Before we describe our model in detail, it is useful to note why such a model is required. Different modalities typically carry different kinds of information. For example, people often caption an image to say things that may not be obvious from the image itself, such as the name of the person, place, or a particular object in the picture. Unless we do multimodal learning, it would not be possible to discover a lot of useful information about the world (for example, ‘what do beautiful sunsets look like?’). We cannot afford to have discriminative models for each such concept and must extract this information from unlabeled data. In a multimodal setting, data consists of multiple input modalities, each modality having a different kind of representation and correlational structure. For example, text is usually represented as discrete sparse word count vectors, whereas an image is represented using pixel intensities or outputs of feature extractors which are real-valued and dense. This makes it much harder to discover relationships across modalities than relationships among features in the same modality. There is a lot of structure in the input but it is difﬁcult to discover the highly non-linear relationships that exist 1
Figure 1: Left: Examples of text generated from a DBM by sampling from P (vtxt |vimg , θ). Right: Examples of images retrieved using features generated from a DBM by sampling from P (vimg |vtxt , θ).
between low-level features across different modalities. Moreover, these observations are typically very noisy and may have missing values. A good multimodal learning model must satisfy certain properties. The joint representation must be such...