Module: Relaton::Bib::Sanitizer
- Defined in:
- lib/relaton/bib/sanitizer.rb
Overview
Strips inline markup not in the basicdoc PureTextElement set (plus <p>, <eref>, <xref>, <fn>) from raw marked-up content strings. Disallowed elements are unwrapped: tags removed, inner text kept.
<fn> is admitted beyond strict PureTextElement because bibliographic titles in real Metanorma input routinely carry footnotes (e.g. ISO standards titles with a disclaimer footnote), and downstream consumers — notably relaton-render’s own inline-tag allow-list —already accept <fn> as a legitimate child of <title>. Stripping it here would break the round-trip.
OPAQUE elements (currently <stem>) are also allowed, but the sanitiser does not descend into them: their contents are out-of-band inline notation (MathML, AsciiMath, LaTeX) rather than basicdoc markup, and must be preserved verbatim. Without the opaque-skip, the recursive walk would unwrap MathML / AsciiMath elements down to bare text nodes — see #116 for the round-trip-loss symptom.
Constant Summary collapse
- ALLOWED =
%w[ em strong sub sup tt underline strike smallcap br stem p eref xref fn ].freeze
- OPAQUE =
Elements whose children are non-basicdoc inline notation (MathML, AsciiMath, LaTeX, …) and must be preserved verbatim rather than sanitised against ALLOWED.
%w[stem].freeze
- RENAME =
{ "italic" => "em", }.freeze
- TAG_RX =
%r{<[a-zA-Z/!?]}
Class Method Summary collapse
Class Method Details
.sanitize(content) ⇒ Object
39 40 41 42 43 44 45 46 47 |
# File 'lib/relaton/bib/sanitizer.rb', line 39 def self.sanitize(content) return content unless sanitizable?(content) fragment = Nokogiri::XML::DocumentFragment.parse(content) return content if fragment.errors.any? sanitize_children(fragment) fragment.children.map { |c| c.to_xml(encoding: "UTF-8") }.join end |