Class: Moxml::EntityRegistry

Inherits:
Object
  • Object
show all
Defined in:
lib/moxml/entity_registry.rb

Overview

EntityRegistry maintains a knowledge base of XML entity definitions.

Data source: W3C XML Core WG Character Entities (bundled) www.w3.org/2003/entities/2007/htmlmathml

The W3C entity data is bundled in data/w3c_entities.json and loaded from the gem’s data directory. For development, MOXML_ENTITY_DEFINITIONS_PATH can be set to an external copy.

Per W3C XML Core WG guidance:

  • Character entities are XML internal general entities providing a name for a single Unicode character

  • Standard XML entities (amp, lt, gt, quot, apos) are implicitly declared per XML specification

  • External entity sets (like HTML, MathML) can be referenced via DTD parameter entities

Examples:

Basic usage

registry = EntityRegistry.new
registry.declared?("amp")  # => true
registry.codepoint_for_name("amp")  # => 38

Defined Under Namespace

Classes: EntityDataError

Constant Summary collapse

ENTITY_DATA_FILE =

W3C entity data file name

"w3c_entities.json"
STANDARD_CODEPOINTS =

Standard XML predefined entities (XML spec §4.6)

Set[0x26, 0x3C, 0x3E, 0x22, 0x27].freeze

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(mode: :required, entity_provider: nil) ⇒ EntityRegistry

Returns a new instance of EntityRegistry.

Parameters:

  • mode (Symbol) (defaults to: :required)

    Loading mode: :required, :optional, :disabled, :custom

  • entity_provider (Proc, nil) (defaults to: nil)

    Custom entity provider proc/lambda



114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
# File 'lib/moxml/entity_registry.rb', line 114

def initialize(mode: :required, entity_provider: nil)
  @by_name = {}
  @by_codepoint = Hash.new { |h, k| h[k] = [] }
  @mode = mode
  @entity_provider = entity_provider

  case mode
  when :required
    load_from_entity_data
  when :optional
    load_from_entity_data_optional
  when :custom
    load_custom_entities
  when :disabled
    # Don't load anything - empty registry
  end
end

Instance Attribute Details

#by_codepointHash{Integer => Array<String>} (readonly)

Returns codepoint to entity names mapping.

Returns:

  • (Hash{Integer => Array<String>})

    codepoint to entity names mapping



110
111
112
# File 'lib/moxml/entity_registry.rb', line 110

def by_codepoint
  @by_codepoint
end

#by_nameHash{String => Integer} (readonly)

Returns entity name to codepoint mapping.

Returns:

  • (Hash{String => Integer})

    entity name to codepoint mapping



107
108
109
# File 'lib/moxml/entity_registry.rb', line 107

def by_name
  @by_name
end

Class Method Details

.defaultEntityRegistry

Get the default registry instance (lazy loaded)

Returns:



42
43
44
# File 'lib/moxml/entity_registry.rb', line 42

def default
  @default ||= new
end

.entity_dataHash{String => String}

Get the raw entity data from the bundled JSON source

Returns:

  • (Hash{String => String})

    entity name to character mapping



36
37
38
# File 'lib/moxml/entity_registry.rb', line 36

def entity_data
  @entity_data ||= load_entity_data
end

.resetvoid

This method returns an undefined value.

Reset the default registry (mainly for testing)



48
49
50
51
# File 'lib/moxml/entity_registry.rb', line 48

def reset
  @default = nil
  @entity_data = nil
end

Instance Method Details

#clear!self

Clear all entities (reset to empty)

Returns:

  • (self)


236
237
238
239
240
# File 'lib/moxml/entity_registry.rb', line 236

def clear!
  @by_name = {}
  @by_codepoint = Hash.new { |h, k| h[k] = [] }
  self
end

#codepoint_for_name(name) ⇒ Integer?

Get the Unicode codepoint for an entity name

Parameters:

  • name (String)

    entity name

Returns:

  • (Integer, nil)

    codepoint or nil if not found



142
143
144
# File 'lib/moxml/entity_registry.rb', line 142

def codepoint_for_name(name)
  @by_name[name]
end

#declared?(name) ⇒ Boolean

Check if an entity name is declared

Parameters:

  • name (String)

    entity name (e.g., “amp”, “nbsp”)

Returns:

  • (Boolean)


135
136
137
# File 'lib/moxml/entity_registry.rb', line 135

def declared?(name)
  @by_name.key?(name)
end

#load_allself

Load all standard entity sets

Returns:

  • (self)


229
230
231
232
# File 'lib/moxml/entity_registry.rb', line 229

def load_all
  # All entities are loaded by default from initialize
  self
end

#load_html5self

Load all entities from the W3C HTMLMathML entity set This is called automatically by initialize

Returns:

  • (self)


207
208
209
210
# File 'lib/moxml/entity_registry.rb', line 207

def load_html5
  # All entities are loaded by default from initialize
  self
end

#load_iso(_set_name = :iso8879) ⇒ self

Load ISO entity sets (included in HTMLMathML)

Parameters:

  • _set_name (Symbol) (defaults to: :iso8879)

    (ignored, all loaded together)

Returns:

  • (self)


222
223
224
225
# File 'lib/moxml/entity_registry.rb', line 222

def load_iso(_set_name = :iso8879)
  # All entities are loaded by default from initialize
  self
end

#load_mathmlself

Load MathML entity set (included in HTMLMathML)

Returns:

  • (self)


214
215
216
217
# File 'lib/moxml/entity_registry.rb', line 214

def load_mathml
  # All entities are loaded by default from initialize
  self
end

#names_for_codepoint(codepoint) ⇒ Array<String>

Get all entity names for a codepoint

Parameters:

  • codepoint (Integer)

    Unicode codepoint

Returns:

  • (Array<String>)

    entity names mapping to this codepoint



149
150
151
# File 'lib/moxml/entity_registry.rb', line 149

def names_for_codepoint(codepoint)
  @by_codepoint[codepoint]
end

#primary_name_for_codepoint(codepoint) ⇒ String?

Get the primary (preferred) entity name for a codepoint

Parameters:

  • codepoint (Integer)

    Unicode codepoint

Returns:

  • (String, nil)

    primary entity name or nil



156
157
158
# File 'lib/moxml/entity_registry.rb', line 156

def primary_name_for_codepoint(codepoint)
  @by_codepoint[codepoint]&.first
end

#register(entities) ⇒ self

Register additional entities

Parameters:

  • entities (Hash{String => Integer})

    name => codepoint mapping

Returns:

  • (self)


195
196
197
198
199
200
201
202
# File 'lib/moxml/entity_registry.rb', line 195

def register(entities)
  entities.each do |name, codepoint|
    @by_name[name] = codepoint
    @by_codepoint[codepoint] ||= []
    @by_codepoint[codepoint] << name unless @by_codepoint[codepoint].include?(name)
  end
  self
end

#restorable_codepointsSet<Integer>

Returns the set of codepoints that could potentially be restored as entities. Used by DocumentBuilder for O(1) fast-path checks.

Returns:

  • (Set<Integer>)


184
185
186
187
188
189
190
# File 'lib/moxml/entity_registry.rb', line 184

def restorable_codepoints
  @restorable_codepoints ||= if @by_name.empty?
                               STANDARD_CODEPOINTS
                             else
                               Set.new(@by_name.values).freeze
                             end
end

#should_restore?(codepoint, config:) ⇒ Boolean

Determine if an entity reference should be restored for a codepoint. Standard XML entities are always restored (required by XML spec). Non-standard entities are only restored when restore_entities is enabled.

Parameters:

  • codepoint (Integer)

    Unicode codepoint

  • config (Moxml::Config)

    configuration object

Returns:

  • (Boolean)


173
174
175
176
177
178
179
# File 'lib/moxml/entity_registry.rb', line 173

def should_restore?(codepoint, config:)
  name = primary_name_for_codepoint(codepoint)
  return false unless name
  return true if standard_entity?(codepoint)

  config.restore_entities
end

#standard_entity?(codepoint) ⇒ Boolean

Check if a codepoint is one of the 5 standard XML predefined entities

Parameters:

  • codepoint (Integer)

    Unicode codepoint

Returns:

  • (Boolean)


163
164
165
# File 'lib/moxml/entity_registry.rb', line 163

def standard_entity?(codepoint)
  STANDARD_CODEPOINTS.include?(codepoint)
end