Class: Ignis::Collective::Algorithms::DoubleBinaryTree
- Inherits:
-
Object
- Object
- Ignis::Collective::Algorithms::DoubleBinaryTree
- Defined in:
- lib/nvruby/collective/algorithms/double_binary_tree.rb
Overview
Double Binary Tree Algorithm for heterogeneous GPU counts
When GPU count is not a power of 2, a single binary tree has imbalanced load. The double binary tree uses two overlapping trees to balance communication.
Key insight: GPU i participates in both trees but at different levels.
-
Tree 1: Standard binary tree rooted at 0
-
Tree 2: Binary tree rooted at N/2 (or closest power of 2)
This achieves near-optimal latency even for non-power-of-2 GPU counts.
Defined Under Namespace
Classes: TreeNode
Instance Attribute Summary collapse
-
#gpu_ids ⇒ Array<Integer>
readonly
GPU IDs.
-
#n_gpus ⇒ Integer
readonly
Number of GPUs.
-
#transport_selector ⇒ TransportSelector
readonly
Transport selector.
Instance Method Summary collapse
-
#broadcast(buffer:, buffers:, size:, root:, streams:) ⇒ void
Broadcast using double tree (for fault tolerance / load balancing).
-
#build! ⇒ void
Build the double binary tree structure.
-
#initialize(gpu_ids:, transport_selector:) ⇒ DoubleBinaryTree
constructor
A new instance of DoubleBinaryTree.
-
#reduce(buffers:, sizes:, dtype:, op:, root:, streams:) ⇒ void
Reduce using double tree.
Constructor Details
#initialize(gpu_ids:, transport_selector:) ⇒ DoubleBinaryTree
Returns a new instance of DoubleBinaryTree.
31 32 33 34 35 36 37 |
# File 'lib/nvruby/collective/algorithms/double_binary_tree.rb', line 31 def initialize(gpu_ids:, transport_selector:) @gpu_ids = gpu_ids.dup.freeze @n_gpus = gpu_ids.size @transport_selector = transport_selector @tree1 = nil @tree2 = nil end |
Instance Attribute Details
#gpu_ids ⇒ Array<Integer> (readonly)
Returns GPU IDs.
21 22 23 |
# File 'lib/nvruby/collective/algorithms/double_binary_tree.rb', line 21 def gpu_ids @gpu_ids end |
#n_gpus ⇒ Integer (readonly)
Returns Number of GPUs.
24 25 26 |
# File 'lib/nvruby/collective/algorithms/double_binary_tree.rb', line 24 def n_gpus @n_gpus end |
#transport_selector ⇒ TransportSelector (readonly)
Returns Transport selector.
27 28 29 |
# File 'lib/nvruby/collective/algorithms/double_binary_tree.rb', line 27 def transport_selector @transport_selector end |
Instance Method Details
#broadcast(buffer:, buffers:, size:, root:, streams:) ⇒ void
This method returns an undefined value.
Broadcast using double tree (for fault tolerance / load balancing)
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
# File 'lib/nvruby/collective/algorithms/double_binary_tree.rb', line 58 def broadcast(buffer:, buffers:, size:, root:, streams:) return if @n_gpus == 1 build! if @tree1.nil? # For non-power-of-2, use split strategy if power_of_2?(@n_gpus) # Standard single tree broadcast broadcast_tree(@tree1, buffers, size, root, streams) else # Split data: half via tree1, half via tree2 half_size = size / 2 # Tree 1 handles first half broadcast_tree_partial(@tree1, buffers, 0, half_size, root, streams) # Tree 2 handles second half broadcast_tree_partial(@tree2, buffers, half_size, half_size, root, streams) end end |
#build! ⇒ void
This method returns an undefined value.
Build the double binary tree structure
41 42 43 44 45 46 47 48 |
# File 'lib/nvruby/collective/algorithms/double_binary_tree.rb', line 41 def build! @tree1 = build_binary_tree(root_rank: 0, tree_id: 1) # For tree 2, use a different root to balance load # Root is at opposite end of the array tree2_root = @n_gpus / 2 @tree2 = build_binary_tree(root_rank: tree2_root, tree_id: 2) end |
#reduce(buffers:, sizes:, dtype:, op:, root:, streams:) ⇒ void
This method returns an undefined value.
Reduce using double tree
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
# File 'lib/nvruby/collective/algorithms/double_binary_tree.rb', line 88 def reduce(buffers:, sizes:, dtype:, op:, root:, streams:) return if @n_gpus == 1 build! if @tree1.nil? if power_of_2?(@n_gpus) reduce_tree(@tree1, buffers, sizes[0], dtype, op, root, streams) else # Split reduction via both trees, combine at root half_size = sizes[0] / 2 elem_size = dtype_elem_size(dtype) half_count = half_size / elem_size # Allocate temp buffer at root for tree2 partial result temp_buffer = allocate_buffer(@gpu_ids[root], half_size) begin # Tree 1: reduce first half to root reduce_tree_partial(@tree1, buffers, 0, half_size, dtype, op, root, streams) # Tree 2: reduce second half to tree2's root, then send to actual root tree2_root = @n_gpus / 2 reduce_tree_partial(@tree2, buffers, half_size, half_size, dtype, op, tree2_root, streams) # If tree2 root != actual root, transfer if tree2_root != root transport = @transport_selector.select_transport(@gpu_ids[tree2_root], @gpu_ids[root]) if transport.is_a?(Transport::P2PTransport) src = ptr_offset(buffers[tree2_root], half_size) dst = ptr_offset(buffers[root], half_size) transport.copy_async(dst, src, half_size, get_stream_ptr(streams[root])) end end synchronize_all_streams!(streams) ensure free_buffer(temp_buffer, @gpu_ids[root]) end end end |