Class: Lutaml::Xml::EncodingNormalizer

Inherits:
Object
  • Object
show all
Defined in:
lib/lutaml/xml/encoding_normalizer.rb

Overview

EncodingNormalizer ensures all XML text content is normalized to UTF-8 internally, regardless of source encoding or adapter used.

This provides:

  • Consistent developer experience across adapters

  • UTF-8 as internal encoding (Ruby’s default)

  • Ability to output in any encoding on serialization

Examples:

Normalize Shift_JIS to UTF-8

content = "手書き英字".encode("Shift_JIS")
normalized = EncodingNormalizer.normalize_to_utf8(content)
normalized.encoding # => Encoding::UTF_8

Class Method Summary collapse

Class Method Details

.normalize_to_utf8(content, source_encoding: nil) ⇒ String

Normalize text content to UTF-8 for internal consistency

Parameters:

  • content (String)

    Text content from XML adapter

  • source_encoding (String, Encoding, nil) (defaults to: nil)

    Source encoding if known

Returns:

  • (String)

    UTF-8 encoded string, or original if nil/empty



24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# File 'lib/lutaml/xml/encoding_normalizer.rb', line 24

def self.normalize_to_utf8(content, source_encoding: nil)
  return content if content.nil? || content.empty?

  # Return content if already valid UTF-8
  if content.encoding == Encoding::UTF_8 && content.valid_encoding?
    return content
  end

  # Determine source encoding
  encoding = resolve_encoding(content, source_encoding)

  # Convert to UTF-8
  content.encode(Encoding::UTF_8, encoding,
                 invalid: :replace,
                 undef: :replace,
                 replace: "?")
rescue Encoding::UndefinedConversionError,
       Encoding::InvalidByteSequenceError
  # Fallback: force UTF-8 encoding and scrub invalid bytes
  content.force_encoding(Encoding::UTF_8).scrub("?")
end