Jessica Ray (MIT Lincoln Laboratory) 2014 In this talk, we compare the implementation of deep learning networks [1] on traditional x86 processors with the implementation on NVIDIA Tesla K20 GPU Accelerators for the purposes of training Restricted Boltzmann Machines [2] and for deep network back propagation in a large-vocabulary speech recognition task (automatic transcription of TED talks). Two GPU implementations are compared: 1) a high-level implementation using Theano [3] and 2) a native implementation using low-level CUDA BLAS libraries. We describe the scaling properties of these implementations in comparison to a baseline batched-x86 implementation as a function of training data size. We also explore the development time tradeoffs for each of the implementations
Hide player controls
Hide resume playing