Module: ParseKit

Defined in:
lib/parsekit.rb,
lib/parsekit/error.rb,
lib/parsekit/parser.rb,
lib/parsekit/version.rb

Overview

ParseKit is a Ruby document parsing toolkit with PDF and OCR support

Defined Under Namespace

Classes: Parser

Constant Summary collapse

SUPPORTED_FORMATS =

Supported file formats and their extensions

{
  pdf: ['.pdf'],
  docx: ['.docx'],
  xlsx: ['.xlsx'],
  xls: ['.xls'],
  pptx: ['.pptx'],
  png: ['.png'],
  jpeg: ['.jpg', '.jpeg'],
  tiff: ['.tiff', '.tif'],
  bmp: ['.bmp'],
  json: ['.json'],
  xml: ['.xml', '.html'],
  text: ['.txt', '.md', '.csv']
}.freeze
VERSION =
"0.2.0"

Class Method Summary collapse

Class Method Details

.detect_format(filename) ⇒ Symbol

Detect file format from filename/extension

Parameters:

  • filename (String, nil)

    The filename to check

Returns:

  • (Symbol)

    The detected format, or :unknown



78
79
80
81
82
83
84
85
86
87
88
89
# File 'lib/parsekit.rb', line 78

def detect_format(filename)
  return :unknown if filename.nil? || filename.empty?
  
  ext = File.extname(filename).downcase
  return :unknown if ext.empty?
  
  SUPPORTED_FORMATS.each do |format, extensions|
    return format if extensions.include?(ext)
  end
  
  :unknown
end

.native_versionString

Get the native library version

Returns:

  • (String)

    Version of the native library



93
94
95
96
97
# File 'lib/parsekit.rb', line 93

def native_version
  version
rescue StandardError
  "unknown"
end

.parse(input, options = {}) ⇒ String

Convenience method to parse input directly (for text)

Parameters:

  • input (String)

    The input string to parse

  • options (Hash) (defaults to: {})

    Optional configuration options

Options Hash (options):

  • :encoding (String)

    Input encoding (default: UTF-8)

Returns:

  • (String)

    The parsed result



48
49
50
# File 'lib/parsekit.rb', line 48

def parse(input, options = {})
  Parser.new(options).parse(input)
end

.parse_bytes(data, options = {}) ⇒ String

Parse binary data

Parameters:

  • data (String, Array)

    Binary data to parse

  • options (Hash) (defaults to: {})

    Optional configuration options

Returns:

  • (String)

    The extracted text



56
57
58
59
60
# File 'lib/parsekit.rb', line 56

def parse_bytes(data, options = {})
  # Convert string to bytes if needed
  byte_data = data.is_a?(String) ? data.bytes : data
  Parser.new(options).parse_bytes(byte_data)
end

.supported_formatsArray<String>

Get supported file formats

Returns:

  • (Array<String>)

    List of supported file extensions



64
65
66
# File 'lib/parsekit.rb', line 64

def supported_formats
  Parser.supported_formats
end

.supports_file?(path) ⇒ Boolean

Check if a file format is supported

Parameters:

  • path (String)

    File path to check

Returns:

  • (Boolean)

    True if the file format is supported



71
72
73
# File 'lib/parsekit.rb', line 71

def supports_file?(path)
  Parser.new.supports_file?(path)
end