Parallelism methods about distributed training

weifengma-wish

741人浏览 · 2025-08-13 16:54:17

weifengma-wish · 2025-08-13 16:54:17 发布

1. Model parallelism

Model parallelism distributes a model across multiple GPUs. There are several ways to split a model, but the typical method distributes the model layers across GPUs. On the forward pass, the first GPU processes a batch of data and passes it to the next group of layers on the next GPU. For the backward pass, the data is sent backward from the final layer to the first layer.

Model parallelism is a useful strategy for training models that are too large to fit into the memory of a single GPU. However, GPU utilization is unbalanced because only one GPU is active at a time. Passing results between GPUs also adds communication overhead and it can be a bottleneck.

理解：模型层顺序分配到不同的GPU，方便解决单张GPU的显存限制，但是他是流水线的工作方式，GPU利用率就较低。

2. Data parallelism

Data parallelism evenly distributes data across multiple GPUs. Each GPU holds a copy of the model and concurrently processes their portion of the data. At the end, the results from each GPU are synchronized and combined.

Data parallelism significantly reduces training time by processing data in parallel, and it is scalable to the number of GPUs available. However, synchronizing results from each GPU can add overhead.

There are two types of data parallelism, DataParallel (DP) and DistributedDataParallel (DDP).

理解：模型参数结构在每个GPU上完整copy，数据拆分并行处理，这个样速度快，但是不能解决显存问题，而且整合数据可能也需要增加较大开销。

3. Pipeline parallelism

Pipeline parallelism is conceptually very similar to model parallelism, but it’s more efficient because it reduces the amount of idle GPU time. Instead of waiting for each GPU to finish processing a batch of data, pipeline parallelism creates micro-batches of data. As soon as one micro-batch is finished, it is passed to the next GPU. This way, each GPU can concurrently process part of the data without waiting for the other GPU to completely finish processing a mini batch of data.

Pipeline parallelism shares the same advantages as model parallelism, but it optimizes GPU utilization and reduces idle time. But pipeline parallelism can be more complex because models may need to be rewritten as a sequence of nn.Sequential modules and it also isn’t possible to completely reduce idle time because the last forward pass must also wait for the backward pass to finish.

4. Tensor parallelism

Tensor parallelism distributes large tensor computations across multiple GPUs. The tensors are sliced horizontally or vertically and each slice is processed by a separate GPU. Each GPU performs its calculations on its tensor slice and the results are synchronized at the end to reconstruct the final result.

Tensor parallelism is effective for training large models that don’t fit into the memory of a single GPU. It is also faster and more efficient because each GPU can process its tensor slice in parallel, and it can be combined with other parallelism methods. Like other parallelism methods though, tensor parallelism adds communication overhead between GPUs.

5. Hybrid parallelism

Parallelism methods can be combined to achieve even greater memory savings and more efficiently train models with billions of parameters.

6. 关于一个参数

device_map (str or dict[str, Union[int, str, torch.device]] or int or torch.device, optional) — A map that specifies where each submodule should go. It doesn’t need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device. If we only pass the device (e.g., "cpu", "cuda:1", "mps", or a GPU ordinal rank like 1) on which the model will be allocated, the device map will map the entire model to this device. Passing device_map = 0 means put the whole model on GPU 0.

To have Accelerate compute the most optimized device_map automatically, set device_map="auto". For more information about each option see designing a device map

参加：https://huggingface.co/docs/transformers/v4.53.3/en/main_classes/model#transformers.PreTrainedModel.from_pretrained

参见：https://huggingface.co/docs/transformers/v4.53.3/en/perf_train_gpu_many.

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

欧美市场呼叫中心选型：如何评估GDPR合规性与AI技术型服务能力

核心要点欧美呼叫中心选型需建立“合规底线-本地化深度-技术效能”评估框架，超越单一价格维度，重点核查ISO27701等资质及GDPR落地颗粒度。头部厂商能力边界分化明显：东软云科技侧重AI技术型售后与全链路合规，Teleperformance依托超大规模网络提供标准化交付，Concentrix聚焦数据驱动的体验转型。企业应依据业务场景匹配服务商，如严苛合规与技术售后选东软，全球统一标准选TP，数字

2048 AI社区

非技术创业者如何从一个想法快速生成Web原型？

2026年，非技术创业者已经不需要技术合伙人就能验证产品想法。从一个模糊的想法到可以与用户互动的完整Web原型，现在只需要2-5天和几百块钱——而不是3-6个月和数十万元。从"等待开发"到"快速验证"。你不再需要依赖技术人才，不需要投入巨额成本，就能在最短时间内知道你的想法是否真的有市场。完整的验证流程很简单理清想法→ 写一段200字的产品描述（1小时）生成原型→ 用AI工具生成完整可交互原型（1

2048 AI社区

我的 Claude Code 效率工具全套配置分享

claude-mem 在后台运行一个本地 Worker 服务（默认端口 37777），通过 5 个生命周期钩子（SessionStart、UserPromptSubmit、PostToolUse、Summary、SessionEnd）这个插件的灵感来自 Manus 的工作方式。使用快速迭代的框架（Next.js、React、Tailwind 等），或者任何需要查阅 API 文档的开发工作。特别有用