1. Model parallelism

Model parallelism distributes a model across multiple GPUs. There are several ways to split a model, but the typical method distributes the model layers across GPUs. On the forward pass, the first GPU processes a batch of data and passes it to the next group of layers on the next GPU. For the backward pass, the data is sent backward from the final layer to the first layer.

Model parallelism is a useful strategy for training models that are too large to fit into the memory of a single GPU. However, GPU utilization is unbalanced because only one GPU is active at a time. Passing results between GPUs also adds communication overhead and it can be a bottleneck.

理解: 模型层顺序分配到不同的GPU,方便解决单张GPU的显存限制,但是他是流水线的工作方式,GPU利用率就较低。

2. Data parallelism

Data parallelism evenly distributes data across multiple GPUs. Each GPU holds a copy of the model and concurrently processes their portion of the data. At the end, the results from each GPU are synchronized and combined.

Data parallelism significantly reduces training time by processing data in parallel, and it is scalable to the number of GPUs available. However, synchronizing results from each GPU can add overhead.

There are two types of data parallelism, DataParallel (DP) and DistributedDataParallel (DDP).

理解:模型参数结构在每个GPU上完整copy,数据拆分并行处理,这个样速度快,但是不能解决显存问题,而且整合数据可能也需要增加较大开销。

3. Pipeline parallelism

Pipeline parallelism is conceptually very similar to model parallelism, but it’s more efficient because it reduces the amount of idle GPU time. Instead of waiting for each GPU to finish processing a batch of data, pipeline parallelism creates micro-batches of data. As soon as one micro-batch is finished, it is passed to the next GPU. This way, each GPU can concurrently process part of the data without waiting for the other GPU to completely finish processing a mini batch of data.

Pipeline parallelism shares the same advantages as model parallelism, but it optimizes GPU utilization and reduces idle time. But pipeline parallelism can be more complex because models may need to be rewritten as a sequence of nn.Sequential modules and it also isn’t possible to completely reduce idle time because the last forward pass must also wait for the backward pass to finish.

4. Tensor parallelism

Tensor parallelism distributes large tensor computations across multiple GPUs. The tensors are sliced horizontally or vertically and each slice is processed by a separate GPU. Each GPU performs its calculations on its tensor slice and the results are synchronized at the end to reconstruct the final result.

Tensor parallelism is effective for training large models that don’t fit into the memory of a single GPU. It is also faster and more efficient because each GPU can process its tensor slice in parallel, and it can be combined with other parallelism methods. Like other parallelism methods though, tensor parallelism adds communication overhead between GPUs.

5. Hybrid parallelism

Parallelism methods can be combined to achieve even greater memory savings and more efficiently train models with billions of parameters.

6. 关于一个参数

  • device_map (str or dict[str, Union[int, str, torch.device]] or int or torch.deviceoptional) — A map that specifies where each submodule should go. It doesn’t need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device. If we only pass the device (e.g."cpu""cuda:1""mps", or a GPU ordinal rank like 1) on which the model will be allocated, the device map will map the entire model to this device. Passing device_map = 0 means put the whole model on GPU 0.

    To have Accelerate compute the most optimized device_map automatically, set device_map="auto". For more information about each option see designing a device map

参加:https://huggingface.co/docs/transformers/v4.53.3/en/main_classes/model#transformers.PreTrainedModel.from_pretrained

参见:https://huggingface.co/docs/transformers/v4.53.3/en/perf_train_gpu_many.

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐