Module: Relaton::Bib::Sanitizer

Defined in:
lib/relaton/bib/sanitizer.rb

Overview

Strips inline markup not in the basicdoc PureTextElement set (plus <p>, <eref>, <xref>, <fn>) from raw marked-up content strings. Disallowed elements are unwrapped: tags removed, inner text kept.

<fn> is admitted beyond strict PureTextElement because bibliographic titles in real Metanorma input routinely carry footnotes (e.g. ISO standards titles with a disclaimer footnote), and downstream consumers — notably relaton-render’s own inline-tag allow-list —already accept <fn> as a legitimate child of <title>. Stripping it here would break the round-trip.

OPAQUE elements (currently <stem>) are also allowed, but the sanitiser does not descend into them: their contents are out-of-band inline notation (MathML, AsciiMath, LaTeX) rather than basicdoc markup, and must be preserved verbatim. Without the opaque-skip, the recursive walk would unwrap MathML / AsciiMath elements down to bare text nodes — see #116 for the round-trip-loss symptom.

Constant Summary collapse

ALLOWED =
%w[
  em strong sub sup tt underline strike smallcap br stem
  p eref xref fn
].freeze
OPAQUE =

Elements whose children are non-basicdoc inline notation (MathML, AsciiMath, LaTeX, …) and must be preserved verbatim rather than sanitised against ALLOWED.

%w[stem].freeze
RENAME =
{
  "italic" => "em",
}.freeze
TAG_RX =
%r{<[a-zA-Z/!?]}

Class Method Summary collapse

Class Method Details

.sanitize(content) ⇒ Object



39
40
41
42
43
44
45
46
47
# File 'lib/relaton/bib/sanitizer.rb', line 39

def self.sanitize(content)
  return content unless sanitizable?(content)

  fragment = Nokogiri::XML::DocumentFragment.parse(content)
  return content if fragment.errors.any?

  sanitize_children(fragment)
  fragment.children.map { |c| c.to_xml(encoding: "UTF-8") }.join
end