Module: MultiXML::Parsers::LibxmlSax Private

Extended by:
MultiXML::Parser
Defined in:
lib/multi_xml/parsers/libxml_sax.rb

Overview

This module is part of a private API. You should avoid using this module if possible, as it may be removed or be changed in the future.

SAX-based parser using LibXML (faster for large documents)

Defined Under Namespace

Classes: Handler

Constant Summary collapse

ParseError =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Exception class raised on LibXML parse failure

::LibXML::XML::Error

Class Method Summary collapse

Methods included from MultiXML::Parser

parse_error

Class Method Details

.attribute_names(tag) ⇒ Array<String>

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Extract non-xmlns attribute names from a start tag

Parameters:

  • tag (String)

    Start tag source

Returns:

  • (Array<String>)

    attribute names



53
54
55
56
57
# File 'lib/multi_xml/parsers/libxml_sax.rb', line 53

def attribute_names(tag)
  tag.scan(/\s([a-zA-Z_][\w.-]*(?::[a-zA-Z_][\w.-]*)?)\s*=/).flatten.reject do |name|
    name == "xmlns" || name.start_with?("xmlns:")
  end
end

.dom_fallback?(source, namespaces) ⇒ Boolean

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Determine whether libxml_sax must fall back to the DOM parser

Parameters:

  • source (String)

    XML source

  • namespaces (Symbol)

    Namespace handling mode

Returns:

  • (Boolean)

    true when DOM parsing is required



65
66
67
# File 'lib/multi_xml/parsers/libxml_sax.rb', line 65

def dom_fallback?(source, namespaces)
  namespaces != :strip || stripped_attribute_collision?(source)
end

.parse(xml, namespaces: :strip) ⇒ Hash

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Parse XML from a string or IO object

Parameters:

  • xml (String, IO)

    XML content

  • namespaces (Symbol) (defaults to: :strip)

    Namespace handling mode

Returns:

  • (Hash)

    Parsed XML as a hash

Raises:

  • (LibXML::XML::Error)

    if XML is malformed



27
28
29
30
31
32
33
34
# File 'lib/multi_xml/parsers/libxml_sax.rb', line 27

def parse(xml, namespaces: :strip)
  source = xml.respond_to?(:read) ? xml.read : xml.to_s
  return {} if source.empty?

  return parse_with_dom(source, namespaces) if dom_fallback?(source, namespaces)

  parse_with_sax(source, namespaces)
end

.parse_with_dom(source, namespaces) ⇒ Hash

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Parse via the DOM libxml backend

Parameters:

  • source (String)

    XML source

  • namespaces (Symbol)

    Namespace handling mode

Returns:

  • (Hash)

    Parsed XML as a hash



75
76
77
# File 'lib/multi_xml/parsers/libxml_sax.rb', line 75

def parse_with_dom(source, namespaces)
  Libxml.parse(StringIO.new(source), namespaces: namespaces)
end

.parse_with_sax(source, namespaces) ⇒ Hash

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Parse via libxml-ruby's SAX parser

Parameters:

  • source (String)

    XML source

  • namespaces (Symbol)

    Namespace handling mode

Returns:

  • (Hash)

    Parsed XML as a hash



85
86
87
88
89
90
91
92
# File 'lib/multi_xml/parsers/libxml_sax.rb', line 85

def parse_with_sax(source, namespaces)
  LibXML::XML::Error.set_handler(&LibXML::XML::Error::QUIET_HANDLER)
  handler = Handler.new(namespaces)
  parser = ::LibXML::XML::SaxParser.io(StringIO.new(source))
  parser.callbacks = handler
  parser.parse
  handler.result
end

.stripped_attribute_collision?(source) ⇒ Boolean

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Detect whether a start tag has attributes that collide after stripping

Parameters:

  • source (String)

    XML source

Returns:

  • (Boolean)

    true when stripped attribute locals collide



41
42
43
44
45
46
# File 'lib/multi_xml/parsers/libxml_sax.rb', line 41

def stripped_attribute_collision?(source)
  source.scan(%r{<(?![!?/])[^>]*>}m).any? do |tag|
    local_names = attribute_names(tag).map { |name| name.split(":", 2).last }
    local_names.uniq.length < local_names.length
  end
end