Class: Uniword::Validation::SchemaRegistry

Inherits:
Object
  • Object
show all
Defined in:
lib/uniword/validation/schema_registry.rb

Overview

Maps XML namespace URIs to XSD schema files and loads schemas for validation.

Uses Moxml for namespace detection (parsing XML to find declared namespaces) and Nokogiri::XML::Schema for actual XSD validation.

The registry knows which OOXML namespace URIs correspond to which XSD files bundled in data/schemas/.

Examples:

Detect namespaces and load schema for a part

registry = SchemaRegistry.new
ns_uris = registry.detect_namespaces(xml_string)
schema = registry.schema_for_namespaces(ns_uris)
errors = schema.validate(Nokogiri::XML(xml_string))

Constant Summary collapse

NAMESPACE_XSD_MAP =

Map of namespace URI => XSD file (relative to data/schemas/)

{
  # Base WordprocessingML (ISO 29500)
  "http://schemas.openxmlformats.org/wordprocessingml/2006/main" =>
    "microsoft/wml-2010.xsd",

  # Microsoft versioned extensions
  "http://schemas.microsoft.com/office/word/2010/wordml" =>
    "microsoft/wml-2010.xsd",
  "http://schemas.microsoft.com/office/word/2012/wordml" =>
    "microsoft/wml-2012.xsd",
  "http://schemas.microsoft.com/office/word/2015/wordml/symex" =>
    "microsoft/wml-symex-2015.xsd",
  "http://schemas.microsoft.com/office/word/2016/wordml/cid" =>
    "microsoft/wml-cid-2016.xsd",
  "http://schemas.microsoft.com/office/word/2018/wordml" =>
    "microsoft/wml-2018.xsd",
  "http://schemas.microsoft.com/office/word/2018/wordml/cex" =>
    "microsoft/wml-cex-2018.xsd",
  "http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" =>
    "microsoft/wml-sdtdatahash-2020.xsd",

  # Markup Compatibility
  "http://schemas.openxmlformats.org/markup-compatibility/2006" =>
    "mce/mc.xsd",

  # OPC schemas
  "http://schemas.openxmlformats.org/package/2006/content-types" =>
    "ecma/opc-contentTypes.xsd",
  "http://schemas.openxmlformats.org/package/2006/relationships" =>
    "ecma/opc-relationships.xsd",
}.freeze
WORDML_PARTS =

Parts that use WordprocessingML schemas

%w[
  word/document.xml
  word/styles.xml
  word/settings.xml
  word/fontTable.xml
  word/numbering.xml
  word/footnotes.xml
  word/endnotes.xml
  word/comments.xml
].freeze
%r{\Aword/(header|footer)\d*\.xml\z}
THEME_PATTERN =

Pattern for theme parts

%r{\Aword/theme/theme\d+\.xml\z}
RELS_PATTERN =

Pattern for relationship parts

%r{_rels/.*\.rels\z}

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(schemas_dir: nil) ⇒ SchemaRegistry

Returns a new instance of SchemaRegistry.



78
79
80
81
82
# File 'lib/uniword/validation/schema_registry.rb', line 78

def initialize(schemas_dir: nil)
  @schemas_dir = schemas_dir || default_schemas_dir
  @moxml = Moxml.new(:nokogiri)
  @schema_cache = {}
end

Instance Attribute Details

#schemas_dirObject (readonly)

Returns the value of attribute schemas_dir.



76
77
78
# File 'lib/uniword/validation/schema_registry.rb', line 76

def schemas_dir
  @schemas_dir
end

Instance Method Details

#detect_ignorable(xml_content) ⇒ Array<String>

Detect mc:Ignorable prefixes from XML content.

Parameters:

  • xml_content (String)

    Raw XML

Returns:

  • (Array<String>)

    Namespace prefixes listed in mc:Ignorable



100
101
102
103
104
105
106
107
108
109
110
111
# File 'lib/uniword/validation/schema_registry.rb', line 100

def detect_ignorable(xml_content)
  doc = @moxml.parse(xml_content)
  root = doc.root
  ignorable = root["Ignorable"] ||
    root["mc:Ignorable"]
  return [] unless ignorable

  ignorable.split(/\s+/).reject(&:empty?)
rescue StandardError => e
  Uniword.logger&.debug { "Ignorable detection failed: #{e.message}" }
  []
end

#detect_namespaces(xml_content) ⇒ Array<String>

Detect namespace URIs from XML content.

Parameters:

  • xml_content (String)

    Raw XML

Returns:

  • (Array<String>)

    Namespace URIs declared on root element



88
89
90
91
92
93
94
# File 'lib/uniword/validation/schema_registry.rb', line 88

def detect_namespaces(xml_content)
  doc = @moxml.parse(xml_content)
  doc.root.namespaces.map(&:uri)
rescue StandardError => e
  Uniword.logger&.debug { "Namespace detection failed: #{e.message}" }
  []
end

#known_namespace?(uri) ⇒ Boolean

Check if a namespace URI has a known XSD schema.

Parameters:

  • uri (String)

    Namespace URI

Returns:

  • (Boolean)


181
182
183
# File 'lib/uniword/validation/schema_registry.rb', line 181

def known_namespace?(uri)
  NAMESPACE_XSD_MAP.key?(uri)
end

#load_schema(xsd_relative_path) ⇒ Nokogiri::XML::Schema

Load and cache an XSD schema for validation.

Parameters:

  • xsd_relative_path (String)

    Path relative to schemas_dir

Returns:

  • (Nokogiri::XML::Schema)

    Compiled schema



138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# File 'lib/uniword/validation/schema_registry.rb', line 138

def load_schema(xsd_relative_path)
  return @schema_cache[xsd_relative_path] if @schema_cache.key?(xsd_relative_path)

  xsd_path = File.join(schemas_dir, xsd_relative_path)
  unless File.exist?(xsd_path)
    raise ArgumentError,
          "XSD schema not found: #{xsd_path}"
  end

  # Nokogiri::XML::Schema resolves relative imports from CWD.
  # We chdir to the schema file's own directory so imports like
  # "word12.xsd" (from microsoft/wml-2010.xsd) resolve correctly.
  schema = nil
  original_dir = Dir.pwd
  schema_dir = File.dirname(xsd_path)
  begin
    Dir.chdir(schema_dir)
    schema = Nokogiri::XML::Schema(File.read(File.basename(xsd_path)))
  ensure
    Dir.chdir(original_dir)
  end

  @schema_cache[xsd_relative_path] = schema
  schema
end

#primary_schema_for_part(part_name) ⇒ String?

Determine the primary XSD schema for a given XML part.

For Word parts (document.xml, styles.xml, etc.), returns wml-2010.xsd which imports the base wml.xsd and all extension schemas. For relationship and content type parts, returns the appropriate schema.

Parameters:

  • part_name (String)

    Path within ZIP (e.g., “word/document.xml”)

Returns:

  • (String, nil)

    XSD path relative to schemas_dir



121
122
123
124
125
126
127
128
129
130
131
132
# File 'lib/uniword/validation/schema_registry.rb', line 121

def primary_schema_for_part(part_name)
  case part_name
  when "[Content_Types].xml"
    "ecma/opc-contentTypes.xsd"
  when "_rels/.rels", ->(n) { n.match?(RELS_PATTERN) }
    "ecma/opc-relationships.xsd"
  when *WORDML_PARTS, ->(n) { n.match?(HEADER_FOOTER_PATTERN) }
    "microsoft/wml-2010.xsd"
  when ->(n) { n.match?(THEME_PATTERN) }
    "iso/dml-main.xsd"
  end
end

#unknown_namespaces(ns_uris) ⇒ Array<String>

Return unknown namespace URIs from a set.

Parameters:

  • ns_uris (Array<String>)

    Namespace URIs to check

Returns:

  • (Array<String>)

    URIs with no bundled XSD



189
190
191
# File 'lib/uniword/validation/schema_registry.rb', line 189

def unknown_namespaces(ns_uris)
  ns_uris.reject { |uri| known_namespace?(uri) }
end

#xsd_map_for_namespaces(ns_uris) ⇒ Hash<String, String>

Map namespace URIs to their corresponding XSD files.

Parameters:

  • ns_uris (Array<String>)

    Namespace URIs to look up

Returns:

  • (Hash<String, String>)

    { uri => xsd_relative_path }



168
169
170
171
172
173
174
175
# File 'lib/uniword/validation/schema_registry.rb', line 168

def xsd_map_for_namespaces(ns_uris)
  result = {}
  ns_uris.each do |uri|
    xsd = NAMESPACE_XSD_MAP[uri]
    result[uri] = xsd if xsd
  end
  result
end