IJCAI-2020

This project is maintained by aritra-dutta

Compressed Communication for Large-scale Distributed Deep Learning — A Tutorial

Tutorial Venue

International Joint Conference on Artificial Intelliugence (IJCAI 2020), Yokohoma, Japan

Tutorial Dates

11-13th July, 2020

Tutorial Slides

Please follow this Link to download the presentation slides.

Presenters

El Houcine Bergou, houcine.bergou@kaust.edu.sa

Aritra Dutta, aritra.dutta@kaust.edu.sa, Personal Website

Panos Kalnis, panos.kalnis@kaust.edu.sa, Personal Website

King Abdullah University of Science and Technology (KAUST)

Description

We survey compressed communication methods for distributed deep learning and discuss the theoretical background, as well as practical deployment on TensorFlow and PyTorch. We also present quantitative comparison of the training speed and model accuracy of compressed communication methods on popular deep neural network models and datasets.

Abstract

Recent advances in machine learning and availability of huge corpus of digital data resulted in an explosive growth of DNN model sizes; consequently, the required computational resources have dramatically increased. As a result, distributed learning is becoming the de-facto norm. However, scaling various systems to support fast DNN training on large clusters of compute nodes, is challenging. Recent works have identified that most distributed training workloads are communication-bound. To remedy the network bottleneck, various compression techniques emerged, including sparsification and quantization of the communicated gradients, as well as low-rank methods. Despite the potential gains, researchers and practitioners face a daunting task when choosing an appropriate compression technique. The reason is that training speed and model accuracy depend on multiple factors such as the actual framework used for the implementation, the communication library, the network bandwidth and the characteristics of the model, to name a few.

In this tutorial, we will provide an overview of the state-of-the-art gradient compression methods for distributed deep learning. We will present the theoretical background and convergence guaranties of the most representative sparcification, quantization and low-rank compression methods. We will also discuss their practical implementation on TensorFlow and PyTorch with different communication libraries, such as Horovod, OpenMPI and NCCL. Additionally, we will present a quantitative comparison of the most popular gradient compression techniques in terms of training speed and model accuracy, for a variety of deep neural network models and datasets. We aim to provide a comprehensive theoretical and practical background that will allow researchers and practitioners to utilize the appropriate compression methods in their projects.

Outline of the tutorial

The tutorial is divided into several parts:

What is distributed training?

A distributed optimization problem minimizes a function \min_{x\in \R^d} f(x)= \frac{1}{n}\sum_{i=1}^n f_i(x), where n is the number of workers. Each worker has a local copy of the model and has access to a partition of the training data. The workers jointly update the model parameters x\in\mathbb{R}^d, where d corresponds to the number of parameters. Typically, f_i is an expectation function defined on a random variable that samples the data. This is also known as distributed data-parallel training as well. The following figure shows two instances of distributed training.

[(a) Centralized distributed SGD setup by using n workers and unique master/parameter server. (b) An example of decentralized distributed SGD by using n machines forming a ring topology.]

How distributed training is performed?

In the distributed data-parallel training each computing node or the worker has the local copy the DNN model. In the following Figure we show how a distributed training is performed at node i. [(a) DNN architecture at node i. (b) Gradient compression mechanism for one of the layer of a DNN.]

What is the bottleneck? What is the remedy?

The parameters of modern DNNs belong to a high-dimensional space. As a result, the gradient vectors are high dimensional as well. As the DNN architechture shows, during the backpropagation, each node calculates the layer-wise gradient. However, these large gradient vectors need to be communicated among the workers and are exchanged through the network, and the aggregated values are sent back to the workers. This process is repeated unto convergence. The gradient communication indeed involves large amounts of data and the network bandwidth becomes the bottleneck.

To alleviate this problem, many recent work propose to compress the communicated gradients to reduce the transferred data volume. This tutorial focuses on gradient compression. We note that parameter compression is not our interest and orthogonal to our work. Formally, we define the gradient compression mechanism as follows:

For a formal definition of the compression operator Q, we refer the readers to this paper.

Classification of Compression

We identify four main classes of compressors in the literature:

We refer the following table for an comprehensive overview of the gradient compression techniques. Although we do not claim it is exhaustive.

Is Layer-wise compression better than the full model compression?

Compression methods can reduce communicated data-volume and they provide convergence guarantees (under some standard assumptions). However, there is a discrepancy between the theoretical analysis and practical implementation of existing compression methods. To the best of our knowledge, the theoretical analysis of every prior gradient compression method assume that the compression is applied to the gradient vector of the entire model. However, from existing implementations perspective and experience with implementing compression methods, we observe that the compression is applied layer by layer to the DNN, (as illustrated in the next Figure) in all state-of-the-art deep learning toolkits, such as TensorFlow, Pytorch etc. to enforce wait-free backpropagation where the gradients of each layer are sent as soon as they are available.

[(a) Layer-wise training vs (b) entire model training]

A Unified Framework

Now we propose a unified, general, compressed, communication-efficient SGD framework.

Selected References