Class: LlmDocsBuilder::TextCompressor

Inherits:
Object
  • Object
show all
Defined in:
lib/llm_docs_builder/text_compressor.rb

Overview

Advanced text compression techniques for reducing token count

Provides more aggressive text compression methods including stopword removal, duplicate content detection, and sentence deduplication. These methods are more aggressive than basic markdown cleanup and should be used carefully.

Examples:

Basic usage

compressor = LlmDocsBuilder::TextCompressor.new
compressed = compressor.compress("Your text here", remove_stopwords: true)

Constant Summary collapse

STOPWORDS =

Common English stopwords that can be safely removed from documentation Excludes words that might be important in technical contexts (like “not”, “no”)

%w[
  a an the this that these those
  is am are was were be being been
  have has had do does did
  will would shall should may might must can could
  i me my mine we us our ours
  you your yours
  he him his she her hers it its
  they them their theirs
  what which who whom whose where when why how
  all both each few more most other some such
  and or but if then else
  at by for from in into of on to with
  as so than
  very really quite
  there here
  about above across after against along among around because before behind below
  beneath beside besides between beyond during except inside near off since through
  throughout under until up upon within without
].freeze

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ TextCompressor

Initialize a new text compressor

Parameters:

  • options (Hash) (defaults to: {})

    compression options

Options Hash (options):

  • :custom_stopwords (Array<String>)

    additional stopwords to remove

  • :preserve_technical (Boolean)

    preserve technical terms and code



47
48
49
50
51
52
# File 'lib/llm_docs_builder/text_compressor.rb', line 47

def initialize(options = {})
  @options = {
    preserve_technical: true,
    custom_stopwords: []
  }.merge(options)
end

Instance Attribute Details

#optionsHash (readonly)

Returns compression options.

Returns:

  • (Hash)

    compression options



40
41
42
# File 'lib/llm_docs_builder/text_compressor.rb', line 40

def options
  @options
end

Instance Method Details

#compress(content, methods = {}) ⇒ String

Compress text using configured methods

Parameters:

  • content (String)

    text to compress

  • methods (Hash) (defaults to: {})

    compression methods to apply

Options Hash (methods):

  • :remove_stopwords (Boolean)

    remove common filler words

  • :remove_duplicates (Boolean)

    remove duplicate sentences/paragraphs

Returns:

  • (String)

    compressed text



61
62
63
64
65
66
67
68
# File 'lib/llm_docs_builder/text_compressor.rb', line 61

def compress(content, methods = {})
  result = content.dup

  result = remove_stopwords(result) if methods[:remove_stopwords]
  result = remove_duplicate_paragraphs(result) if methods[:remove_duplicates]

  result
end

#remove_duplicate_paragraphs(content) ⇒ String

Remove duplicate paragraphs from text

Detects and removes paragraphs that are duplicates or near-duplicates. Documentation often repeats concepts across different sections.

Parameters:

  • content (String)

    text to process

Returns:

  • (String)

    text with duplicate paragraphs removed



138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# File 'lib/llm_docs_builder/text_compressor.rb', line 138

def remove_duplicate_paragraphs(content)
  # Split into paragraphs (separated by blank lines)
  paragraphs = content.split(/\n\s*\n/)

  # Track seen paragraphs with normalized comparison
  seen = {}
  unique_paragraphs = []

  paragraphs.each do |para|
    # Normalize for comparison (remove extra whitespace, lowercase)
    normalized = para.gsub(/\s+/, ' ').strip.downcase

    # Skip empty paragraphs
    next if normalized.empty?

    # Check if we've seen this or similar paragraph
    unless seen[normalized]
      seen[normalized] = true
      unique_paragraphs << para
    end
  end

  unique_paragraphs.join("\n\n")
end

#remove_stopwords(content) ⇒ String

Deprecated.

This is an aggressive optimization that may affect readability. Use with caution and test results carefully.

Remove stopwords from text while preserving technical content

Removes common English stopwords that don’t carry significant meaning. Preserves code blocks, inline code, and technical terms.

Parameters:

  • content (String)

    text to process

Returns:

  • (String)

    text with stopwords removed



80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# File 'lib/llm_docs_builder/text_compressor.rb', line 80

def remove_stopwords(content)
  # Preserve code blocks by temporarily replacing them
  code_blocks = {}
  code_counter = 0

  # Extract and preserve fenced code blocks
  content = content.gsub(/^```.*?^```/m) do |match|
    placeholder = "___CODE_BLOCK_#{code_counter}___"
    code_blocks[placeholder] = match
    code_counter += 1
    placeholder
  end

  # Extract and preserve inline code
  content = content.gsub(/`[^`]+`/) do |match|
    placeholder = "___INLINE_CODE_#{code_counter}___"
    code_blocks[placeholder] = match
    code_counter += 1
    placeholder
  end

  # Get combined stopwords list
  stopwords_list = STOPWORDS + options[:custom_stopwords]

  # Process each line
  content = content.split("\n").map do |line|
    # Skip markdown headers, lists, and links
    if line.match?(/^#+\s/) || line.match?(/^[\*\-]\s/) || line.match?(/\[[^\]]+\]\([^)]+\)/)
      line
    else
      # Remove stopwords from regular text
      words = line.split(/\b/)
      words.map do |word|
        # Preserve the word if it's not a stopword or if we should preserve technical terms
        if stopwords_list.include?(word.downcase) && !word.match?(/^[A-Z]/) # Don't remove capitalized words
          ''
        else
          word
        end
      end.join
    end
  end.join("\n")

  # Restore code blocks
  code_blocks.each do |placeholder, original|
    content = content.gsub(placeholder, original)
  end

  content
end