Deep learning
大数据AllReduce
已知集合通信对于小数据量的表现比较差,进行多次的小数据量的通信不如将小数据组成一个大数据,再进行一次集合通信。
Gradient Bucketing
Gradient bucketing is motivated by the observation that collective communications are more efficient on large tensors.
Instead of launching a dedicated AllReduce immediately when each gradient tensor becomes available, we can achieve higher throughput and lower latency if it waits for a short period of time and buckets multiple gradients into one AllReduce operation.
The optimal value of bucket size need be measured by each use cases.
Overlap computation with communication
While the layer backward phase is finished, the AllReduce operation will be activated.
prioritize gradient
In order to start the next training iteration early, we can prioritize gradient synchronizations and parameter updates based on the forward order instead of the backward order. This means gradient buckets containing the initial layers should receive higher priorities than those in the final layers.
MPI
Broadcast
Common Way
|
|
在500个核的情况下,进行实验可得出广播的通讯时延与通讯数据大小基本成线性关系。
data size | 8M | 80M | 200M | 400 M | 800M |
---|---|---|---|---|---|
time | 0.074 s | 0.21 s | 0.464 s | 0.94 s | 1.96 s |
inter-node
|
|
在500个核的情况下,进行实验可得出广播的通讯时延与通讯数据大小基本成线性关系。
data size | 8M | 80M | 200M | 400 M | 800M |
---|---|---|---|---|---|
time | 0.02 s | 0.14 s | 0.358 s | 0.761 s | 1.55 s |
inter-node then in-node
广播数据到各个进程
- 方式1: 采用节点间先广播,然后节点内在进行广播
- 方式2: 直接基于进程进行广播
在超算平台上进行实验得到的结论是:如果数据量不大的情况下两种方式耗时差不多,无明显差异,但是当数据量达到400M时候,方式1比方式2耗时更大。可利用的加速方式是可以让数据在节点间进行广播,然后在节点内采用内存共享的方式进行数据的获取。
|
|