Module: Lutaml::Xml::DocTypeExtractor

Included in:
Adapter::NokogiriAdapter, Adapter::OgaAdapter, Adapter::OxAdapter
Defined in:
lib/lutaml/xml/doctype_extractor.rb

Overview

Extracts DOCTYPE information from raw XML strings

This module provides a shared method to extract DOCTYPE declarations from raw XML strings when the XML library doesn’t directly expose this information (as is the case with Moxml/Oga and Ox).

Nokogiri provides native access to DOCTYPE via ‘parsed.internal_subset`, so it doesn’t need this extraction method.

This logic is identical in both Oga and Ox adapters and has been extracted here to maintain DRY principles.

Instance Method Summary collapse

Instance Method Details

#extract_doctype_from_xml(xml) ⇒ Hash?

Extract DOCTYPE information from raw XML string

Parses the DOCTYPE declaration using a regex pattern to extract:

  • Document type name

  • Public identifier (if PUBLIC doctype)

  • System identifier (external DTD location)

Examples:

Parsing a PUBLIC DOCTYPE

xml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">'
info = extract_doctype_from_xml(xml)
# => {name: "html", public_id: "-//W3C//DTD XHTML 1.0//EN", system_id: "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"}

Parsing a SYSTEM DOCTYPE

xml = '<!DOCTYPE note SYSTEM "note.dtd">'
info = extract_doctype_from_xml(xml)
# => {name: "note", public_id: nil, system_id: "note.dtd"}

Parameters:

  • xml (String)

    the raw XML string

Returns:

  • (Hash, nil)

    DOCTYPE info hash or nil if no DOCTYPE found

    • :name [String] the document type name

    • :public_id [String, nil] the public identifier (PUBLIC only)

    • :system_id [String, nil] the system identifier



39
40
41
42
43
44
45
46
47
48
# File 'lib/lutaml/xml/doctype_extractor.rb', line 39

def extract_doctype_from_xml(xml)
  # Match DOCTYPE declaration using regex
  if xml =~ /<!DOCTYPE\s+(\S+)(?:\s+(PUBLIC|SYSTEM)\s+"([^"]+)"(?:\s+"([^"]+)")?)?\s*>/
    {
      name: $1,
      public_id: ($2 == "PUBLIC" ? $3 : nil),
      system_id: ($2 == "PUBLIC" ? $4 : $3),
    }
  end
end