Class: Canon::Comparison::FormatDetector

Inherits:
Object
  • Object
show all
Defined in:
lib/canon/comparison/format_detector.rb

Overview

Format detection service for auto-detecting document formats

Provides format detection for various document types including XML, HTML, JSON, YAML, and plain text. Uses caching for performance optimization.

Examples:

Detect format from a string

FormatDetector.detect("<root>content</root>") # => :xml

Detect format from an object

FormatDetector.detect(Moxml::Document.new) # => :xml

Constant Summary collapse

FORMATS =

Supported format types

%i[xml html json yaml ruby_object string].freeze

Class Method Summary collapse

Class Method Details

.detect(obj) ⇒ Symbol

Detect the format of an object

Parameters:

  • obj (Object)

    Object to detect format of

Returns:

  • (Symbol)

    Format type (:xml, :html, :json, :yaml, :ruby_object, :string)



24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# File 'lib/canon/comparison/format_detector.rb', line 24

def detect(obj)
  if XmlBackend.moxml?
    case obj
    when Moxml::Node, Moxml::Document
      :xml
    when String
      detect_string(obj)
    when Hash, Array
      :ruby_object
    else
      raise Canon::Error, "Unknown format for object: #{obj.class}"
    end
  else
    case obj
    when Moxml::Node, Moxml::Document
      :xml
    when Nokogiri::HTML::DocumentFragment, Nokogiri::HTML5::DocumentFragment
      :html
    when Nokogiri::XML::DocumentFragment
      obj.document&.html? ? :html : :xml
    when Nokogiri::XML::Document, Nokogiri::XML::Node
      obj.html? ? :html : :xml
    when Nokogiri::HTML::Document, Nokogiri::HTML5::Document
      :html
    when String
      detect_string(obj)
    when Hash, Array
      :ruby_object
    else
      raise Canon::Error, "Unknown format for object: #{obj.class}"
    end
  end
end

.detect_string(str) ⇒ Symbol

Detect the format of a string with caching

Parameters:

  • str (String)

    String to detect format of

Returns:

  • (Symbol)

    Format type



62
63
64
65
66
67
# File 'lib/canon/comparison/format_detector.rb', line 62

def detect_string(str)
  # Use cache for format detection
  Cache.fetch(:format_detect, Cache.key_for_format_detection(str)) do # rubocop:disable Lint/UselessDefaultValueArgument
    detect_string_uncached(str)
  end
end

.detect_string_uncached(str) ⇒ Symbol

Detect the format of a string without caching

Parameters:

  • str (String)

    String to detect format of

Returns:

  • (Symbol)

    Format type



73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# File 'lib/canon/comparison/format_detector.rb', line 73

def detect_string_uncached(str)
  # Convert to UTF-8 for consistent handling if possible
  # This handles cases like UTF-16 encoded XML that would otherwise fail string operations
  str_utf8 = if ["UTF-16", "UTF-16BE",
                 "UTF-16LE"].include?(str.encoding.name)
               begin
                 str.encode("UTF-8", str.encoding, invalid: :replace,
                                                   undef: :replace, replace: "?")
               rescue EncodingError
                 str.dup.force_encoding("BINARY").encode("UTF-8")
               end
             else
               str
             end

  trimmed = str_utf8.strip

  # YAML indicators
  return :yaml if trimmed.start_with?("---")
  return :yaml if trimmed.match?(/^[a-zA-Z_]\w*:\s/)

  # JSON indicators
  return :json if trimmed.start_with?("{", "[")

  # HTML indicators
  return :html if trimmed.start_with?("<!DOCTYPE html", "<html",
                                      "<HTML")

  # XML indicators - must start with < and end with >
  return :xml if trimmed.start_with?("<") && trimmed.end_with?(">")

  # Default to plain string for everything else
  :string
end