Class: Ignis::Collective::Algorithms::DoubleBinaryTree

Inherits:
Object
  • Object
show all
Defined in:
lib/nvruby/collective/algorithms/double_binary_tree.rb

Overview

Double Binary Tree Algorithm for heterogeneous GPU counts

When GPU count is not a power of 2, a single binary tree has imbalanced load. The double binary tree uses two overlapping trees to balance communication.

Key insight: GPU i participates in both trees but at different levels.

  • Tree 1: Standard binary tree rooted at 0

  • Tree 2: Binary tree rooted at N/2 (or closest power of 2)

This achieves near-optimal latency even for non-power-of-2 GPU counts.

Defined Under Namespace

Classes: TreeNode

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(gpu_ids:, transport_selector:) ⇒ DoubleBinaryTree

Returns a new instance of DoubleBinaryTree.

Parameters:

  • gpu_ids (Array<Integer>)

    GPU device IDs

  • transport_selector (TransportSelector)

    Transport selector



31
32
33
34
35
36
37
# File 'lib/nvruby/collective/algorithms/double_binary_tree.rb', line 31

def initialize(gpu_ids:, transport_selector:)
  @gpu_ids = gpu_ids.dup.freeze
  @n_gpus = gpu_ids.size
  @transport_selector = transport_selector
  @tree1 = nil
  @tree2 = nil
end

Instance Attribute Details

#gpu_idsArray<Integer> (readonly)

Returns GPU IDs.

Returns:

  • (Array<Integer>)

    GPU IDs



21
22
23
# File 'lib/nvruby/collective/algorithms/double_binary_tree.rb', line 21

def gpu_ids
  @gpu_ids
end

#n_gpusInteger (readonly)

Returns Number of GPUs.

Returns:

  • (Integer)

    Number of GPUs



24
25
26
# File 'lib/nvruby/collective/algorithms/double_binary_tree.rb', line 24

def n_gpus
  @n_gpus
end

#transport_selectorTransportSelector (readonly)

Returns Transport selector.

Returns:



27
28
29
# File 'lib/nvruby/collective/algorithms/double_binary_tree.rb', line 27

def transport_selector
  @transport_selector
end

Instance Method Details

#broadcast(buffer:, buffers:, size:, root:, streams:) ⇒ void

This method returns an undefined value.

Broadcast using double tree (for fault tolerance / load balancing)

Parameters:

  • buffer (FFI::Pointer)

    Source buffer on root

  • buffers (Array<FFI::Pointer>)

    Destination buffers on all GPUs

  • size (Integer)

    Buffer size in bytes

  • root (Integer)

    Root rank

  • streams (Array<FFI::Pointer>)

    CUDA streams



58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# File 'lib/nvruby/collective/algorithms/double_binary_tree.rb', line 58

def broadcast(buffer:, buffers:, size:, root:, streams:)
  return if @n_gpus == 1
  
  build! if @tree1.nil?

  # For non-power-of-2, use split strategy
  if power_of_2?(@n_gpus)
    # Standard single tree broadcast
    broadcast_tree(@tree1, buffers, size, root, streams)
  else
    # Split data: half via tree1, half via tree2
    half_size = size / 2
    
    # Tree 1 handles first half
    broadcast_tree_partial(@tree1, buffers, 0, half_size, root, streams)
    
    # Tree 2 handles second half
    broadcast_tree_partial(@tree2, buffers, half_size, half_size, root, streams)
  end
end

#build!void

This method returns an undefined value.

Build the double binary tree structure



41
42
43
44
45
46
47
48
# File 'lib/nvruby/collective/algorithms/double_binary_tree.rb', line 41

def build!
  @tree1 = build_binary_tree(root_rank: 0, tree_id: 1)
  
  # For tree 2, use a different root to balance load
  # Root is at opposite end of the array
  tree2_root = @n_gpus / 2
  @tree2 = build_binary_tree(root_rank: tree2_root, tree_id: 2)
end

#reduce(buffers:, sizes:, dtype:, op:, root:, streams:) ⇒ void

This method returns an undefined value.

Reduce using double tree

Parameters:

  • buffers (Array<FFI::Pointer>)

    Source buffers

  • sizes (Array<Integer>)

    Buffer sizes

  • dtype (Symbol)

    Data type

  • op (Symbol)

    Reduction operation

  • root (Integer)

    Root rank

  • streams (Array<FFI::Pointer>)

    CUDA streams



88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# File 'lib/nvruby/collective/algorithms/double_binary_tree.rb', line 88

def reduce(buffers:, sizes:, dtype:, op:, root:, streams:)
  return if @n_gpus == 1
  
  build! if @tree1.nil?

  if power_of_2?(@n_gpus)
    reduce_tree(@tree1, buffers, sizes[0], dtype, op, root, streams)
  else
    # Split reduction via both trees, combine at root
    half_size = sizes[0] / 2
    elem_size = dtype_elem_size(dtype)
    half_count = half_size / elem_size

    # Allocate temp buffer at root for tree2 partial result
    temp_buffer = allocate_buffer(@gpu_ids[root], half_size)

    begin
      # Tree 1: reduce first half to root
      reduce_tree_partial(@tree1, buffers, 0, half_size, dtype, op, root, streams)
      
      # Tree 2: reduce second half to tree2's root, then send to actual root
      tree2_root = @n_gpus / 2
      reduce_tree_partial(@tree2, buffers, half_size, half_size, dtype, op, tree2_root, streams)
      
      # If tree2 root != actual root, transfer
      if tree2_root != root
        transport = @transport_selector.select_transport(@gpu_ids[tree2_root], @gpu_ids[root])
        if transport.is_a?(Transport::P2PTransport)
          src = ptr_offset(buffers[tree2_root], half_size)
          dst = ptr_offset(buffers[root], half_size)
          transport.copy_async(dst, src, half_size, get_stream_ptr(streams[root]))
        end
      end
      
      synchronize_all_streams!(streams)
    ensure
      free_buffer(temp_buffer, @gpu_ids[root])
    end
  end
end