Class: Rubino::Tools::SummarizeFileTool

Inherits:
Base
  • Object
show all
Defined in:
lib/rubino/tools/summarize_file_tool.rb

Overview

Summarizes a large text file WITHOUT pulling its bytes into the main agent context. The file is chunked and map-reduced through the ‘summarize` auxiliary LLM; only the final summary string returns to the caller. This is the in-house realization of the “summarization subagent” pattern: the raw 30k-line document lives only in the aux calls, so it never bloats the primary prompt (which is what pushes time-to-first-token past the provider’s stream idle-timeout and gets a run cut mid-stream).

Algorithm (LangChain/OpenAI-cookbook map-reduce):

1. MAP   — split the file into ~CHUNK_BYTES chunks, summarize each.
2. REDUCE— combine the chunk summaries; if the combined text still
           overflows a chunk, group + re-summarize recursively (capped).

Constant Summary collapse

CHUNK_BYTES =

~6k tokens/chunk at 4 bytes/token — leaves room for the prompt and the chunk’s own summary inside a modest context window.

24_000
MAX_FILE_BYTES =

Refuse absurdly large inputs rather than fan out hundreds of LLM calls.

8_000_000
REDUCE_DEPTH_CAP =

Bound the reduce recursion so a pathological fan-in can’t loop forever.

4
GROUP_SIZE =
5
AUX_TASK =
"summarize"

Instance Attribute Summary collapse

Attributes inherited from Base

#cancel_token, #read_tracker, #stream_chunk

Instance Method Summary collapse

Methods inherited from Base

#cancellation_requested?, #config_key, #emit_chunk, #risky?, #to_tool_definition, workspace_root, workspace_roots

Instance Attribute Details

#aux_client=(value) ⇒ Object

Test seam: inject a stub LLM client. Production lazily builds the real AuxiliaryClient, which routes to the ‘auxiliary.summarize` config.



30
31
32
# File 'lib/rubino/tools/summarize_file_tool.rb', line 30

def aux_client=(value)
  @aux_client = value
end

Instance Method Details

#call(arguments) ⇒ Object



65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# File 'lib/rubino/tools/summarize_file_tool.rb', line 65

def call(arguments)
  file_path = arguments["file_path"] || arguments[:file_path]
  focus     = (arguments["focus"] || arguments[:focus]).to_s.strip
  focus     = "the key facts, structure, decisions, and any errors" if focus.empty?
  max_words = (arguments["max_words"] || arguments[:max_words] || 500).to_i.clamp(50, 4000)

  return "Error: file_path is required" if file_path.nil? || file_path.to_s.empty?

  expanded = File.expand_path(file_path)
  return "Error: File not found: #{file_path}" unless File.exist?(expanded)
  return "Error: Not a regular file: #{file_path}" unless File.file?(expanded)

  size = File.size(expanded)
  return "#{file_path} is empty — nothing to summarize." if size.zero?
  if binary?(expanded)
    return "Error: #{file_path} looks binary. Read it with the `read_attachment` tool " \
           "(it converts documents to text in-process and summarizes oversized output), " \
           "rather than summarizing raw bytes."
  end
  if size > MAX_FILE_BYTES
    return "Error: #{file_path} is #{size / 1_000_000}MB, over the " \
           "#{MAX_FILE_BYTES / 1_000_000}MB summarize cap. Split it (e.g. with split/sed) " \
           "or grep to the relevant section, then summarize that."
  end

  chunks = chunk_file(expanded)
  return "#{file_path} is empty — nothing to summarize." if chunks.empty?

  summaries = chunks.each_with_index.map do |chunk, i|
    raise Rubino::Interrupted if cancellation_requested?

    emit_chunk("summarizing chunk #{i + 1}/#{chunks.size}…\n")
    map_summarize(chunk, focus)
  end

  summary = reduce(summaries, focus, max_words)
  {
    output: summary,
    metrics: "#{chunks.size} chunk#{"s" if chunks.size != 1} → summary"
  }
rescue Rubino::Interrupted
  raise
rescue StandardError => e
  "Error summarizing #{file_path}: #{e.message}"
end

#descriptionObject



36
37
38
39
40
41
42
43
44
45
# File 'lib/rubino/tools/summarize_file_tool.rb', line 36

def description
  "Summarize a large text file WITHOUT loading it into this conversation. " \
    "The file is read and map-reduced by a separate summarization model; only the " \
    "final summary returns here, so the raw bytes never enter context. " \
    "PREFER this over `read` whenever you need the gist of a big document — converted " \
    "PDFs, logs, transcripts, anything more than a few hundred lines. For binary docs " \
    "(PDF/DOCX/XLSX/PPTX) use the `read_attachment` tool, which converts them to text " \
    "in-process and summarizes oversized output automatically. " \
    "Use `focus` to steer what the summary must preserve."
end

#input_schemaObject



47
48
49
50
51
52
53
54
55
56
57
58
59
# File 'lib/rubino/tools/summarize_file_tool.rb', line 47

def input_schema
  {
    type: "object",
    properties: {
      file_path: { type: "string", description: "Absolute or relative path to a text file" },
      focus: { type: "string",
               description: "What the summary must preserve, e.g. 'chapter titles and page numbers' or 'API errors with timestamps'. Optional." },
      max_words: { type: "integer",
                   description: "Approximate length of the final summary in words (default 500)." }
    },
    required: %w[file_path]
  }
end

#nameObject



32
33
34
# File 'lib/rubino/tools/summarize_file_tool.rb', line 32

def name
  "summarize_file"
end

#risk_levelObject



61
62
63
# File 'lib/rubino/tools/summarize_file_tool.rb', line 61

def risk_level
  :low
end