Class: Uniword::Validation::SchemaRegistry
- Inherits:
-
Object
- Object
- Uniword::Validation::SchemaRegistry
- Defined in:
- lib/uniword/validation/schema_registry.rb
Overview
Maps XML namespace URIs to XSD schema files and loads schemas for validation.
Uses Moxml for namespace detection (parsing XML to find declared namespaces) and Nokogiri::XML::Schema for actual XSD validation.
The registry knows which OOXML namespace URIs correspond to which XSD files bundled in data/schemas/.
Constant Summary collapse
- NAMESPACE_XSD_MAP =
Map of namespace URI => XSD file (relative to data/schemas/)
{ # Base WordprocessingML (ISO 29500) "http://schemas.openxmlformats.org/wordprocessingml/2006/main" => "microsoft/wml-2010.xsd", # Microsoft versioned extensions "http://schemas.microsoft.com/office/word/2010/wordml" => "microsoft/wml-2010.xsd", "http://schemas.microsoft.com/office/word/2012/wordml" => "microsoft/wml-2012.xsd", "http://schemas.microsoft.com/office/word/2015/wordml/symex" => "microsoft/wml-symex-2015.xsd", "http://schemas.microsoft.com/office/word/2016/wordml/cid" => "microsoft/wml-cid-2016.xsd", "http://schemas.microsoft.com/office/word/2018/wordml" => "microsoft/wml-2018.xsd", "http://schemas.microsoft.com/office/word/2018/wordml/cex" => "microsoft/wml-cex-2018.xsd", "http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" => "microsoft/wml-sdtdatahash-2020.xsd", # Markup Compatibility "http://schemas.openxmlformats.org/markup-compatibility/2006" => "mce/mc.xsd", # OPC schemas "http://schemas.openxmlformats.org/package/2006/content-types" => "ecma/opc-contentTypes.xsd", "http://schemas.openxmlformats.org/package/2006/relationships" => "ecma/opc-relationships.xsd", }.freeze
- WORDML_PARTS =
Parts that use WordprocessingML schemas
%w[ word/document.xml word/styles.xml word/settings.xml word/fontTable.xml word/numbering.xml word/footnotes.xml word/endnotes.xml word/comments.xml ].freeze
- HEADER_FOOTER_PATTERN =
Pattern for header/footer parts
%r{\Aword/(header|footer)\d*\.xml\z}- THEME_PATTERN =
Pattern for theme parts
%r{\Aword/theme/theme\d+\.xml\z}- RELS_PATTERN =
Pattern for relationship parts
%r{_rels/.*\.rels\z}
Instance Attribute Summary collapse
-
#schemas_dir ⇒ Object
readonly
Returns the value of attribute schemas_dir.
Instance Method Summary collapse
-
#detect_ignorable(xml_content) ⇒ Array<String>
Detect mc:Ignorable prefixes from XML content.
-
#detect_namespaces(xml_content) ⇒ Array<String>
Detect namespace URIs from XML content.
-
#initialize(schemas_dir: nil) ⇒ SchemaRegistry
constructor
A new instance of SchemaRegistry.
-
#known_namespace?(uri) ⇒ Boolean
Check if a namespace URI has a known XSD schema.
-
#load_schema(xsd_relative_path) ⇒ Nokogiri::XML::Schema
Load and cache an XSD schema for validation.
-
#primary_schema_for_part(part_name) ⇒ String?
Determine the primary XSD schema for a given XML part.
-
#unknown_namespaces(ns_uris) ⇒ Array<String>
Return unknown namespace URIs from a set.
-
#xsd_map_for_namespaces(ns_uris) ⇒ Hash<String, String>
Map namespace URIs to their corresponding XSD files.
Constructor Details
#initialize(schemas_dir: nil) ⇒ SchemaRegistry
Returns a new instance of SchemaRegistry.
78 79 80 81 82 |
# File 'lib/uniword/validation/schema_registry.rb', line 78 def initialize(schemas_dir: nil) @schemas_dir = schemas_dir || default_schemas_dir @moxml = Moxml.new(:nokogiri) @schema_cache = {} end |
Instance Attribute Details
#schemas_dir ⇒ Object (readonly)
Returns the value of attribute schemas_dir.
76 77 78 |
# File 'lib/uniword/validation/schema_registry.rb', line 76 def schemas_dir @schemas_dir end |
Instance Method Details
#detect_ignorable(xml_content) ⇒ Array<String>
Detect mc:Ignorable prefixes from XML content.
100 101 102 103 104 105 106 107 108 109 110 111 |
# File 'lib/uniword/validation/schema_registry.rb', line 100 def detect_ignorable(xml_content) doc = @moxml.parse(xml_content) root = doc.root ignorable = root["Ignorable"] || root["mc:Ignorable"] return [] unless ignorable ignorable.split(/\s+/).reject(&:empty?) rescue StandardError => e Uniword.logger&.debug { "Ignorable detection failed: #{e.}" } [] end |
#detect_namespaces(xml_content) ⇒ Array<String>
Detect namespace URIs from XML content.
88 89 90 91 92 93 94 |
# File 'lib/uniword/validation/schema_registry.rb', line 88 def detect_namespaces(xml_content) doc = @moxml.parse(xml_content) doc.root.namespaces.map(&:uri) rescue StandardError => e Uniword.logger&.debug { "Namespace detection failed: #{e.}" } [] end |
#known_namespace?(uri) ⇒ Boolean
Check if a namespace URI has a known XSD schema.
181 182 183 |
# File 'lib/uniword/validation/schema_registry.rb', line 181 def known_namespace?(uri) NAMESPACE_XSD_MAP.key?(uri) end |
#load_schema(xsd_relative_path) ⇒ Nokogiri::XML::Schema
Load and cache an XSD schema for validation.
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
# File 'lib/uniword/validation/schema_registry.rb', line 138 def load_schema(xsd_relative_path) return @schema_cache[xsd_relative_path] if @schema_cache.key?(xsd_relative_path) xsd_path = File.join(schemas_dir, xsd_relative_path) unless File.exist?(xsd_path) raise ArgumentError, "XSD schema not found: #{xsd_path}" end # Nokogiri::XML::Schema resolves relative imports from CWD. # We chdir to the schema file's own directory so imports like # "word12.xsd" (from microsoft/wml-2010.xsd) resolve correctly. schema = nil original_dir = Dir.pwd schema_dir = File.dirname(xsd_path) begin Dir.chdir(schema_dir) schema = Nokogiri::XML::Schema(File.read(File.basename(xsd_path))) ensure Dir.chdir(original_dir) end @schema_cache[xsd_relative_path] = schema schema end |
#primary_schema_for_part(part_name) ⇒ String?
Determine the primary XSD schema for a given XML part.
For Word parts (document.xml, styles.xml, etc.), returns wml-2010.xsd which imports the base wml.xsd and all extension schemas. For relationship and content type parts, returns the appropriate schema.
121 122 123 124 125 126 127 128 129 130 131 132 |
# File 'lib/uniword/validation/schema_registry.rb', line 121 def primary_schema_for_part(part_name) case part_name when "[Content_Types].xml" "ecma/opc-contentTypes.xsd" when "_rels/.rels", ->(n) { n.match?(RELS_PATTERN) } "ecma/opc-relationships.xsd" when *WORDML_PARTS, ->(n) { n.match?(HEADER_FOOTER_PATTERN) } "microsoft/wml-2010.xsd" when ->(n) { n.match?(THEME_PATTERN) } "iso/dml-main.xsd" end end |
#unknown_namespaces(ns_uris) ⇒ Array<String>
Return unknown namespace URIs from a set.
189 190 191 |
# File 'lib/uniword/validation/schema_registry.rb', line 189 def unknown_namespaces(ns_uris) ns_uris.reject { |uri| known_namespace?(uri) } end |
#xsd_map_for_namespaces(ns_uris) ⇒ Hash<String, String>
Map namespace URIs to their corresponding XSD files.
168 169 170 171 172 173 174 175 |
# File 'lib/uniword/validation/schema_registry.rb', line 168 def xsd_map_for_namespaces(ns_uris) result = {} ns_uris.each do |uri| xsd = NAMESPACE_XSD_MAP[uri] result[uri] = xsd if xsd end result end |