Class: Canon::Comparison::FormatDetector

Inherits:
Object
  • Object
show all
Defined in:
lib/canon/comparison/format_detector.rb

Overview

Format detection service for auto-detecting document formats

Provides format detection for various document types including XML, HTML, JSON, YAML, and plain text. Uses caching for performance optimization.

Examples:

Detect format from a string

FormatDetector.detect("<root>content</root>") # => :xml

Detect format from an object

FormatDetector.detect(Moxml::Document.new) # => :xml

Constant Summary collapse

FORMATS =

Supported format types

%i[xml html json yaml ruby_object string].freeze

Class Method Summary collapse

Class Method Details

.detect(obj) ⇒ Symbol

Detect the format of an object

Parameters:

  • obj (Object)

    Object to detect format of

Returns:

  • (Symbol)

    Format type (:xml, :html, :json, :yaml, :ruby_object, :string)



24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# File 'lib/canon/comparison/format_detector.rb', line 24

def detect(obj)
  case obj
  when Moxml::Node, Moxml::Document
    :xml
  when Nokogiri::HTML::DocumentFragment, Nokogiri::HTML5::DocumentFragment
    # HTML DocumentFragments
    :html
  when Nokogiri::XML::DocumentFragment
    # XML DocumentFragments - check if it's actually HTML
    obj.document&.html? ? :html : :xml
  when Nokogiri::XML::Document, Nokogiri::XML::Node
    # Check if it's HTML by looking at the document type
    obj.html? ? :html : :xml
  when Nokogiri::HTML::Document, Nokogiri::HTML5::Document
    :html
  when String
    detect_string(obj)
  when Hash, Array
    # Raw Ruby objects (from parsed JSON/YAML)
    :ruby_object
  else
    raise Canon::Error, "Unknown format for object: #{obj.class}"
  end
end

.detect_string(str) ⇒ Symbol

Detect the format of a string with caching

Parameters:

  • str (String)

    String to detect format of

Returns:

  • (Symbol)

    Format type



53
54
55
56
57
58
# File 'lib/canon/comparison/format_detector.rb', line 53

def detect_string(str)
  # Use cache for format detection
  Cache.fetch(:format_detect, Cache.key_for_format_detection(str)) do # rubocop:disable Lint/UselessDefaultValueArgument
    detect_string_uncached(str)
  end
end

.detect_string_uncached(str) ⇒ Symbol

Detect the format of a string without caching

Parameters:

  • str (String)

    String to detect format of

Returns:

  • (Symbol)

    Format type



64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
# File 'lib/canon/comparison/format_detector.rb', line 64

def detect_string_uncached(str)
  # Convert to UTF-8 for consistent handling if possible
  # This handles cases like UTF-16 encoded XML that would otherwise fail string operations
  str_utf8 = if ["UTF-16", "UTF-16BE",
                 "UTF-16LE"].include?(str.encoding.name)
               begin
                 str.encode("UTF-8", str.encoding, invalid: :replace,
                                                   undef: :replace, replace: "?")
               rescue EncodingError
                 str.dup.force_encoding("BINARY").encode("UTF-8")
               end
             else
               str
             end

  trimmed = str_utf8.strip

  # YAML indicators
  return :yaml if trimmed.start_with?("---")
  return :yaml if trimmed.match?(/^[a-zA-Z_]\w*:\s/)

  # JSON indicators
  return :json if trimmed.start_with?("{", "[")

  # HTML indicators
  return :html if trimmed.start_with?("<!DOCTYPE html", "<html",
                                      "<HTML")

  # XML indicators - must start with < and end with >
  return :xml if trimmed.start_with?("<") && trimmed.end_with?(">")

  # Default to plain string for everything else
  :string
end