neu-mods

Northeastern-flavored MODS v3 projection + selection for the DRS, shared by Cerberus (front end) and Atlas (API backend).

It is a Nokogiri-native, dependency-light contract over MODS documents — pure functions over a parsed document, nothing else. No Rails, no persistence, no HTTP. It answers two questions:

  • "Where is X?"Selectors return live Nokogiri nodes, so they serve both the read path (projection reads their text) and the write path (an editor mutates the returned node in place). The node an editor changes is provably the node the projection reads.
  • "What does this project to?"Projection returns plain data (hashes/strings/arrays — never opaque typed objects) for indexing/display.

It depends on Nokogiri alone — deliberately not the sul-dlss/mods + nom-xml stack (which is sunsetting alongside Stanford's move to Cocina). See the design note in the DRS gap-reports for the full rationale.

Usage

require "neu-mods"

doc = NEU::MODS::Document.parse(xml_string)

# Projection (plain data)
doc.plain_title    # => "What's New - How We Respond to Disaster, Episode 1"
doc.title_parts    # => { non_sort:, subtitle:, title:, part_name:, part_number: }
doc.abstract       # => normalized, paragraph-joined String
doc.topical_subjects # => ["Civil society", ...]   (every <topic>, for the access copy)
doc.keywords       # => [...]   (only the editable attribute-free keyword subjects)
doc.to_h           # => full projection, keyed to Atlas's Metadata::MODS attributes

# Pure title composition (no document needed) — for callers that already hold
# the parts (e.g. Atlas's access-copy model) and must not re-parse XML on read.
NEU::MODS.compose_title(non_sort: "", title: "What's New",
                        part_name: "How We Respond to Disaster", part_number: "Episode 1")
# => "What's New - How We Respond to Disaster, Episode 1"   (== doc.plain_title)

# Selectors (live nodes — for editing)
node = doc.primary_title_info.at_xpath("mods:title", NEU::MODS::NAMESPACE)
node.content = "New Title" unless NEU::MODS.whitespace_equivalent?(node.text, "New Title")
doc.to_xml

# Editable creators (for an "advanced metadata" form): structured read,
# node selection (for replace-on-save), and structure-aware build.
doc.editable_personal_creators   # => [{ given:, family: }]  (plain, Creator role)
doc.editable_corporate_creators  # => [{ name: }]
doc.preserved_names              # => [{ name:, role: }]  (authority-bearing / non-Creator — read-only)
doc.editable_creator_nodes("personal")            # => live <name> nodes to replace
doc.build_personal_name(given: "Jenny", family: "Smith")      # => a plain personal <name> node
doc.build_corporate_name(name: "Northeastern University")     # => a plain corporate <name> node

The "editable creator" set is plain names — no @authority/@authorityURI/ @valueURI — with a Creator role; everything else (authority-controlled or other-role names) is preserved_names, shown read-only. This mirrors the keyword-subject curated-vs-editable split. build_*_name's role: defaults to "Creator" but is parameterised, so a later role-selectable form is non-breaking.

Two normalizers, two jobs

  • NEU::MODS.whitespace_equivalent? / .canonical_ws — the no-op guard: did an edit change anything, or only insignificant whitespace? (Used to avoid minting an unchanged OCFL MODS version.)
  • NEU::MODS.normalize_paragraphs / .normalize — clean curator freetext for the JSON/Solr access copy (dash/smart-punctuation transliteration, control stripping, paragraph handling). The XML preservation copy is never touched.

Behavior fidelity & known caveats

The projection is behavior-preserving with Atlas's prior mods-gem-based extraction, pinned by spec/conformance_spec.rb against work-mods.xml. Two intentional notes:

  • Name display reproduces the mods gem's display_value_w_date including its quirks (e.g. multiple given nameParts concatenate with no separator), to preserve existing Solr/display output. Cleanups are a deliberate future contract change, not a silent one.
  • Roles & languages read the type="text" term and fall back to the raw code — they are not MARC-relator / ISO-639 translated. Records carrying text forms (the norm) are unaffected; code-only records would differ. Vendoring those lookup tables (or depending on iso-639) is deferred to keep the gem Nokogiri-only and small.

Source convention

Every character-class regex in TextNormalizer is built programmatically from codepoints, so the source stays pure ASCII (no literal smart-quotes/dashes, no raw control bytes). A spec enforces this. Keep it that way.

Development

bundle install
bundle exec rspec
bundle exec rubocop

Versioned via the .version file (read by lib/neu/mods/version.rb); released with bundler/gem_tasks (rake release), mirroring atlas_rb.