Class: LlmDocsBuilder::TextCompressor
- Inherits:
-
Object
- Object
- LlmDocsBuilder::TextCompressor
- Defined in:
- lib/llm_docs_builder/text_compressor.rb
Overview
Advanced text compression techniques for reducing token count
Provides more aggressive text compression methods including stopword removal, duplicate content detection, and sentence deduplication. These methods are more aggressive than basic markdown cleanup and should be used carefully.
Constant Summary collapse
- STOPWORDS =
Common English stopwords that can be safely removed from documentation Excludes words that might be important in technical contexts (like “not”, “no”)
%w[ a an the this that these those is am are was were be being been have has had do does did will would shall should may might must can could i me my mine we us our ours you your yours he him his she her hers it its they them their theirs what which who whom whose where when why how all both each few more most other some such and or but if then else at by for from in into of on to with as so than very really quite there here about above across after against along among around because before behind below beneath beside besides between beyond during except inside near off since through throughout under until up upon within without ].freeze
Instance Attribute Summary collapse
-
#options ⇒ Hash
readonly
Compression options.
Instance Method Summary collapse
-
#compress(content, methods = {}) ⇒ String
Compress text using configured methods.
-
#initialize(options = {}) ⇒ TextCompressor
constructor
Initialize a new text compressor.
-
#remove_duplicate_paragraphs(content) ⇒ String
Remove duplicate paragraphs from text.
-
#remove_stopwords(content) ⇒ String
deprecated
Deprecated.
This is an aggressive optimization that may affect readability. Use with caution and test results carefully.
Constructor Details
#initialize(options = {}) ⇒ TextCompressor
Initialize a new text compressor
47 48 49 50 51 52 |
# File 'lib/llm_docs_builder/text_compressor.rb', line 47 def initialize( = {}) @options = { preserve_technical: true, custom_stopwords: [] }.merge() end |
Instance Attribute Details
#options ⇒ Hash (readonly)
Returns compression options.
40 41 42 |
# File 'lib/llm_docs_builder/text_compressor.rb', line 40 def @options end |
Instance Method Details
#compress(content, methods = {}) ⇒ String
Compress text using configured methods
61 62 63 64 65 66 67 68 |
# File 'lib/llm_docs_builder/text_compressor.rb', line 61 def compress(content, methods = {}) result = content.dup result = remove_stopwords(result) if methods[:remove_stopwords] result = remove_duplicate_paragraphs(result) if methods[:remove_duplicates] result end |
#remove_duplicate_paragraphs(content) ⇒ String
Remove duplicate paragraphs from text
Detects and removes paragraphs that are duplicates or near-duplicates. Documentation often repeats concepts across different sections.
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
# File 'lib/llm_docs_builder/text_compressor.rb', line 138 def remove_duplicate_paragraphs(content) # Split into paragraphs (separated by blank lines) paragraphs = content.split(/\n\s*\n/) # Track seen paragraphs with normalized comparison seen = {} unique_paragraphs = [] paragraphs.each do |para| # Normalize for comparison (remove extra whitespace, lowercase) normalized = para.gsub(/\s+/, ' ').strip.downcase # Skip empty paragraphs next if normalized.empty? # Check if we've seen this or similar paragraph unless seen[normalized] seen[normalized] = true unique_paragraphs << para end end unique_paragraphs.join("\n\n") end |
#remove_stopwords(content) ⇒ String
This is an aggressive optimization that may affect readability. Use with caution and test results carefully.
Remove stopwords from text while preserving technical content
Removes common English stopwords that don’t carry significant meaning. Preserves code blocks, inline code, and technical terms.
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
# File 'lib/llm_docs_builder/text_compressor.rb', line 80 def remove_stopwords(content) # Preserve code blocks by temporarily replacing them code_blocks = {} code_counter = 0 # Extract and preserve fenced code blocks content = content.gsub(/^```.*?^```/m) do |match| placeholder = "___CODE_BLOCK_#{code_counter}___" code_blocks[placeholder] = match code_counter += 1 placeholder end # Extract and preserve inline code content = content.gsub(/`[^`]+`/) do |match| placeholder = "___INLINE_CODE_#{code_counter}___" code_blocks[placeholder] = match code_counter += 1 placeholder end # Get combined stopwords list stopwords_list = STOPWORDS + [:custom_stopwords] # Process each line content = content.split("\n").map do |line| # Skip markdown headers, lists, and links if line.match?(/^#+\s/) || line.match?(/^[\*\-]\s/) || line.match?(/\[[^\]]+\]\([^)]+\)/) line else # Remove stopwords from regular text words = line.split(/\b/) words.map do |word| # Preserve the word if it's not a stopword or if we should preserve technical terms if stopwords_list.include?(word.downcase) && !word.match?(/^[A-Z]/) # Don't remove capitalized words '' else word end end.join end end.join("\n") # Restore code blocks code_blocks.each do |placeholder, original| content = content.gsub(placeholder, original) end content end |