Class: LlmDocsBuilder::TextCompressor

Inherits:

Object

Object
LlmDocsBuilder::TextCompressor

show all

Defined in:: lib/llm_docs_builder/text_compressor.rb

Overview

Advanced text compression techniques for reducing token count

Provides more aggressive text compression methods including stopword removal, duplicate content detection, and sentence deduplication. These methods are more aggressive than basic markdown cleanup and should be used carefully.

Examples:

Basic usage

compressor = LlmDocsBuilder::TextCompressor.new
compressed = compressor.compress("Your text here", remove_stopwords: true)

Constant Summary collapse

STOPWORDS = Common English stopwords that can be safely removed from documentation Excludes words that might be important in technical contexts (like “not”, “no”)

%w[
  a an the this that these those
  is am are was were be being been
  have has had do does did
  will would shall should may might must can could
  i me my mine we us our ours
  you your yours
  he him his she her hers it its
  they them their theirs
  what which who whom whose where when why how
  all both each few more most other some such
  and or but if then else
  at by for from in into of on to with
  as so than
  very really quite
  there here
  about above across after against along among around because before behind below
  beneath beside besides between beyond during except inside near off since through
  throughout under until up upon within without
].freeze

Instance Attribute Summary collapse

#options ⇒ Hash readonly

Compression options.

Instance Method Summary collapse

#compress(content, methods = {}) ⇒ String

Compress text using configured methods.
#initialize(options = {}) ⇒ TextCompressor constructor

Initialize a new text compressor.
#remove_duplicate_paragraphs(content) ⇒ String

Remove duplicate paragraphs from text.
#remove_stopwords(content) ⇒ String deprecated Deprecated.

This is an aggressive optimization that may affect readability. Use with caution and test results carefully.

Constructor Details

#initialize(options = {}) ⇒ `TextCompressor`

Initialize a new text compressor

Parameters:

options (Hash) (defaults to: {}) —

compression options

Options Hash (options):

:custom_stopwords (Array<String>) —

additional stopwords to remove
:preserve_technical (Boolean) —

preserve technical terms and code

# File 'lib/llm_docs_builder/text_compressor.rb', line 47

def initialize(options = {})
  @options = {
    preserve_technical: true,
    custom_stopwords: []
  }.merge(options)
end

Instance Attribute Details

#options ⇒ `Hash` (readonly)

Returns compression options.

Returns:

(Hash) —

compression options



40
41
42

# File 'lib/llm_docs_builder/text_compressor.rb', line 40

def options
  @options
end

Instance Method Details

#compress(content, methods = {}) ⇒ `String`

Compress text using configured methods

Parameters:

content (String) —

text to compress
methods (Hash) (defaults to: {}) —

compression methods to apply

Options Hash (methods):

:remove_stopwords (Boolean) —

remove common filler words
:remove_duplicates (Boolean) —

remove duplicate sentences/paragraphs

Returns:

(String) —

compressed text

# File 'lib/llm_docs_builder/text_compressor.rb', line 61

def compress(content, methods = {})
  result = content.dup

  result = remove_stopwords(result) if methods[:remove_stopwords]
  result = remove_duplicate_paragraphs(result) if methods[:remove_duplicates]

  result
end

#remove_duplicate_paragraphs(content) ⇒ `String`

Remove duplicate paragraphs from text

Detects and removes paragraphs that are duplicates or near-duplicates. Documentation often repeats concepts across different sections.

Parameters:

content (String) —

text to process

Returns:

(String) —

text with duplicate paragraphs removed

# File 'lib/llm_docs_builder/text_compressor.rb', line 138

def remove_duplicate_paragraphs(content)
  # Split into paragraphs (separated by blank lines)
  paragraphs = content.split(/\n\s*\n/)

  # Track seen paragraphs with normalized comparison
  seen = {}
  unique_paragraphs = []

  paragraphs.each do |para|
    # Normalize for comparison (remove extra whitespace, lowercase)
    normalized = para.gsub(/\s+/, ' ').strip.downcase

    # Skip empty paragraphs
    next if normalized.empty?

    # Check if we've seen this or similar paragraph
    unless seen[normalized]
      seen[normalized] = true
      unique_paragraphs << para
    end
  end

  unique_paragraphs.join("\n\n")
end

#remove_stopwords(content) ⇒ `String`

Deprecated.

This is an aggressive optimization that may affect readability. Use with caution and test results carefully.

Remove stopwords from text while preserving technical content

Removes common English stopwords that don’t carry significant meaning. Preserves code blocks, inline code, and technical terms.

Parameters:

content (String) —

text to process

Returns:

(String) —

text with stopwords removed

# File 'lib/llm_docs_builder/text_compressor.rb', line 80

def remove_stopwords(content)
  # Preserve code blocks by temporarily replacing them
  code_blocks = {}
  code_counter = 0

  # Extract and preserve fenced code blocks
  content = content.gsub(/^```.*?^```/m) do |match|
    placeholder = "___CODE_BLOCK_#{code_counter}___"
    code_blocks[placeholder] = match
    code_counter += 1
    placeholder
  end

  # Extract and preserve inline code
  content = content.gsub(/`[^`]+`/) do |match|
    placeholder = "___INLINE_CODE_#{code_counter}___"
    code_blocks[placeholder] = match
    code_counter += 1
    placeholder
  end

  # Get combined stopwords list
  stopwords_list = STOPWORDS + options[:custom_stopwords]

  # Process each line
  content = content.split("\n").map do |line|
    # Skip markdown headers, lists, and links
    if line.match?(/^#+\s/) || line.match?(/^[\*\-]\s/) || line.match?(/\[[^\]]+\]\([^)]+\)/)
      line
    else
      # Remove stopwords from regular text
      words = line.split(/\b/)
      words.map do |word|
        # Preserve the word if it's not a stopword or if we should preserve technical terms
        if stopwords_list.include?(word.downcase) && !word.match?(/^[A-Z]/) # Don't remove capitalized words
          ''
        else
          word
        end
      end.join
    end
  end.join("\n")

  # Restore code blocks
  code_blocks.each do |placeholder, original|
    content = content.gsub(placeholder, original)
  end

  content
end

Class: LlmDocsBuilder::TextCompressor

Overview

Examples:

Basic usage

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ TextCompressor

Instance Attribute Details

#options ⇒ Hash (readonly)

Instance Method Details

#compress(content, methods = {}) ⇒ String

#remove_duplicate_paragraphs(content) ⇒ String

#remove_stopwords(content) ⇒ String

#initialize(options = {}) ⇒ `TextCompressor`

#options ⇒ `Hash` (readonly)

#compress(content, methods = {}) ⇒ `String`

#remove_duplicate_paragraphs(content) ⇒ `String`

#remove_stopwords(content) ⇒ `String`