Module: Mindee::PDF::PDFTools
- Defined in:
- lib/mindee/pdf/pdf_tools.rb
Overview
Collection of miscellaneous PDF operations,as well as some monkey-patching for Origami.
Class Method Summary collapse
-
.add_content_to_page(page, xobject_name, width, height) ⇒ Object
Adds a content stream to the specified PDF page to display an image XObject.
-
.create_xobject(image) ⇒ Origami::Graphics::ImageXObject
Creates an image XObject from the provided image.
-
.determine_colorspace(image) ⇒ Symbol
Determines the colorspace for an image based on its metadata.
-
.determine_filter(image) ⇒ Symbol
Determines the appropriate filter for an image based on its properties.
-
.pdf_header?(io_stream, maximum_offset: 500) ⇒ bool
Checks whether a stream contains a PDF header near the beginning.
-
.process_image_xobject(image_data, image_quality, width, height) ⇒ Origami::Graphics::ImageXObject
Processes an image into an image XObject for PDF embedding.
-
.set_page_dimensions(page, width, height) ⇒ Object
Sets the dimensions for the specified PDF page.
-
.set_xobject_properties(xobject, image) ⇒ Object
Sets properties on the provided image XObject based on image metadata.
-
.source_text?(pdf_data) ⇒ bool
Checks whether the file has source_text.
-
.stream_has_text?(stream) ⇒ bool
Checks a PDFs stream content for text operators See https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf page 243-251.
Instance Method Summary collapse
-
#to_io_stream(params = {}) ⇒ StringIO
Converts the current PDF document into a binary-encoded StringIO stream.
Class Method Details
.add_content_to_page(page, xobject_name, width, height) ⇒ Object
Adds a content stream to the specified PDF page to display an image XObject.
159 160 161 162 163 |
# File 'lib/mindee/pdf/pdf_tools.rb', line 159 def self.add_content_to_page(page, xobject_name, width, height) content = "q\n#{width} 0 0 #{height} 0 0 cm\n/#{xobject_name} Do\nQ\n" content_stream = Origami::Stream.new(content) page.Contents = content_stream end |
.create_xobject(image) ⇒ Origami::Graphics::ImageXObject
Creates an image XObject from the provided image.
Converts the given image to a binary stream using Mindee's image utilities, then creates an Origami::Graphics::ImageXObject with a JPEG filter.
110 111 112 113 |
# File 'lib/mindee/pdf/pdf_tools.rb', line 110 def self.create_xobject(image) image_io = Mindee::Image::ImageUtils.image_to_stringio(image) Origami::Graphics::ImageXObject.from_image_file(image_io, 'jpg') end |
.determine_colorspace(image) ⇒ Symbol
Determines the colorspace for an image based on its metadata.
144 145 146 147 148 149 150 151 |
# File 'lib/mindee/pdf/pdf_tools.rb', line 144 def self.determine_colorspace(image) colorspace = image.data['colorspace'] case colorspace when 'CMYK' then :DeviceCMYK when 'Gray', 'PseudoClass Gray' then :DeviceGray else :DeviceRGB end end |
.determine_filter(image) ⇒ Symbol
Determines the appropriate filter for an image based on its properties.
131 132 133 134 135 136 137 138 |
# File 'lib/mindee/pdf/pdf_tools.rb', line 131 def self.determine_filter(image) filter = image.data['properties']['filter'] case filter when %r{Zip}i then :FlateDecode when %r{LZW}i then :LZWDecode else :DCTDecode end end |
.pdf_header?(io_stream, maximum_offset: 500) ⇒ bool
Checks whether a stream contains a PDF header near the beginning.
61 62 63 64 65 66 67 68 69 70 71 |
# File 'lib/mindee/pdf/pdf_tools.rb', line 61 def self.pdf_header?(io_stream, maximum_offset: 500) initial_pos = nil initial_pos = io_stream.pos if io_stream.respond_to?(:pos) io_stream.seek(0) io_stream.gets('%PDF-') !(io_stream.eof? || io_stream.pos > maximum_offset) rescue TypeError, IOError, SystemCallError false ensure io_stream.seek(initial_pos) if !initial_pos.nil? && io_stream.respond_to?(:seek) end |
.process_image_xobject(image_data, image_quality, width, height) ⇒ Origami::Graphics::ImageXObject
Processes an image into an image XObject for PDF embedding.
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
# File 'lib/mindee/pdf/pdf_tools.rb', line 182 def self.process_image_xobject(image_data, image_quality, width, height) compressed_data = Image::ImageCompressor.compress_image( image_data, quality: image_quality, max_width: width, max_height: height ) new_image = Origami::Graphics::ImageXObject.new new_image.data = compressed_data new_image.Width = width new_image.Height = height new_image.ColorSpace = :DeviceRGB new_image.BitsPerComponent = 8 new_image end |
.set_page_dimensions(page, width, height) ⇒ Object
Sets the dimensions for the specified PDF page.
170 171 172 173 |
# File 'lib/mindee/pdf/pdf_tools.rb', line 170 def self.set_page_dimensions(page, width, height) page[:MediaBox] = [0, 0, width, height] page[:CropBox] = [0, 0, width, height] end |
.set_xobject_properties(xobject, image) ⇒ Object
Sets properties on the provided image XObject based on image metadata.
119 120 121 122 123 124 125 |
# File 'lib/mindee/pdf/pdf_tools.rb', line 119 def self.set_xobject_properties(xobject, image) xobject.dictionary[:BitsPerComponent] = 8 xobject.dictionary[:Filter] = determine_filter(image) xobject.dictionary[:Width] = image[:width] xobject.dictionary[:Height] = image[:height] xobject.dictionary[:ColorSpace] = determine_colorspace(image) end |
.source_text?(pdf_data) ⇒ bool
Checks whether the file has source_text. Sends false if the file isn't a PDF.
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
# File 'lib/mindee/pdf/pdf_tools.rb', line 76 def self.source_text?(pdf_data) return false unless pdf_header?(pdf_data) begin pdf_data.rewind pdf = Origami::PDF.read(pdf_data) pdf.each_page do |page| next unless page[:Contents] contents = page[:Contents].solve contents = [contents] unless contents.is_a?(Origami::Array) contents.each do |stream_ref| stream = stream_ref.solve return true if stream_has_text?(stream) end end false end false rescue Origami::InvalidPDFError false end |
.stream_has_text?(stream) ⇒ bool
Checks a PDFs stream content for text operators See https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf page 243-251.
49 50 51 52 53 54 55 |
# File 'lib/mindee/pdf/pdf_tools.rb', line 49 def self.stream_has_text?(stream) data = stream.data return false if data.nil? || data.empty? text_operators = ['Tc', 'Tw', 'Th', 'TL', 'Tf', 'Tk', 'Tr', 'Tm', 'T*', 'Tj', 'TJ', "'", '"'] text_operators.any? { |op| data.include?(op) } end |
Instance Method Details
#to_io_stream(params = {}) ⇒ StringIO
Converts the current PDF document into a binary-encoded StringIO stream.
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
# File 'lib/mindee/pdf/pdf_tools.rb', line 19 def to_io_stream(params = {}) = { delinearize: true, recompile: true, decrypt: false, noindent: nil, } .update(params) if frozen? # incompatible flags with frozen doc (signed) [:recompile] = nil [:rebuild_xrefs] = nil [:noindent] = nil [:obfuscate] = false end load_all_objects unless @loaded intents_as_pdfa1 if [:intent].to_s =~ %r{pdf[/-]?A1?/i} delinearize! if [:delinearize] && linearized? compile() if [:recompile] io_stream = StringIO.new(output()) io_stream.set_encoding Encoding::BINARY io_stream end |