Class: Canon::Comparison::FormatDetector

Inherits:
Object
  • Object
show all
Defined in:
lib/canon/comparison/format_detector.rb

Overview

Format detection service for auto-detecting document formats

Provides format detection for various document types including XML, HTML, JSON, YAML, and plain text. Uses caching for performance optimization.

Examples:

Detect format from a string

FormatDetector.detect("<root>content</root>") # => :xml

Detect format from an object

FormatDetector.detect(Moxml::Document.new) # => :xml

Constant Summary collapse

FORMATS =

Supported format types

%i[xml html json yaml ruby_object string].freeze

Class Method Summary collapse

Class Method Details

.detect(obj) ⇒ Symbol

Detect the format of an object

Parameters:

  • obj (Object)

    Object to detect format of

Returns:

  • (Symbol)

    Format type (:xml, :html, :json, :yaml, :ruby_object, :string)



24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# File 'lib/canon/comparison/format_detector.rb', line 24

def detect(obj)
  case obj
  when Moxml::Node, Moxml::Document
    :xml
  when Nokogiri::HTML::DocumentFragment, Nokogiri::HTML5::DocumentFragment
    # HTML DocumentFragments
    :html
  when Nokogiri::XML::DocumentFragment
    # XML DocumentFragments - check if it's actually HTML
    obj.document&.html? ? :html : :xml
  when Nokogiri::XML::Document, Nokogiri::XML::Node
    # Check if it's HTML by looking at the document type
    obj.html? ? :html : :xml
  when Nokogiri::HTML::Document, Nokogiri::HTML5::Document
    :html
  when String
    detect_string(obj)
  when Hash, Array
    # Raw Ruby objects (from parsed JSON/YAML)
    :ruby_object
  else
    raise Canon::Error, "Unknown format for object: #{obj.class}"
  end
end

.detect_string(str) ⇒ Symbol

Detect the format of a string with caching

Parameters:

  • str (String)

    String to detect format of

Returns:

  • (Symbol)

    Format type



53
54
55
56
57
58
# File 'lib/canon/comparison/format_detector.rb', line 53

def detect_string(str)
  # Use cache for format detection
  Cache.fetch(:format_detect, Cache.key_for_format_detection(str)) do
    detect_string_uncached(str)
  end
end

.detect_string_uncached(str) ⇒ Symbol

Detect the format of a string without caching

Parameters:

  • str (String)

    String to detect format of

Returns:

  • (Symbol)

    Format type



64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# File 'lib/canon/comparison/format_detector.rb', line 64

def detect_string_uncached(str)
  trimmed = str.strip

  # YAML indicators
  return :yaml if trimmed.start_with?("---")
  return :yaml if trimmed.match?(/^[a-zA-Z_]\w*:\s/)

  # JSON indicators
  return :json if trimmed.start_with?("{", "[")

  # HTML indicators
  return :html if trimmed.start_with?("<!DOCTYPE html", "<html",
                                      "<HTML")

  # XML indicators - must start with < and end with >
  return :xml if trimmed.start_with?("<") && trimmed.end_with?(">")

  # Default to plain string for everything else
  :string
end