Module: Oddb2xml::RefdataCleanup

Defined in:
lib/oddb2xml/refdata_cleanup.rb

Overview

Compensates for known data-quality issues in upstream Refdata.Articles.xml before they reach the generated output. Each fix is opt-in and guarded by a heuristic against Swissmedic data so we never alter genuine combination products. See GitHub issue #112 for the catalogue of upstream problems.

Constant Summary collapse

DOSE_TOKEN =
/\d+(?:[.,]\d+)?\s*(?:mg|µg|mcg|g|ml|UI|U\.I\.|IE|%)/i
DOUBLE_DOSE_RE =

Matches “<dose> / <same dose> /” – the templating bug where Refdata repeats the strength once. The backreference 1 only matches when the exact same dose string appears twice, which keeps real combos (e.g. PHESGO 600 mg / 600 mg / 10 ml) safe – those are caught by the single_substance? guard, but the literal-match also acts as a backstop.

/(#{DOSE_TOKEN})\s*\/\s*\1\s*\/\s*/

Class Method Summary collapse

Class Method Details

.fix_double_dose(desc, swissmedic_substance) ⇒ Object

Removes the duplicated dose token in mono products. Returns the cleaned description, or the original string if no change applies.



27
28
29
30
31
32
# File 'lib/oddb2xml/refdata_cleanup.rb', line 27

def self.fix_double_dose(desc, swissmedic_substance)
  return desc if desc.nil? || desc.empty?
  return desc unless DOUBLE_DOSE_RE.match?(desc)
  return desc unless single_substance?(swissmedic_substance)
  desc.sub(DOUBLE_DOSE_RE, '\1 / ')
end

.single_substance?(swissmedic_substance) ⇒ Boolean

A Swissmedic compositions cell like “mirtazapinum” indicates a mono product; “atovaquonum, proguanili hydrochloridum” or “pertuzumabum, trastuzumabum” indicates a real combination.

Returns:

  • (Boolean)


18
19
20
21
22
23
# File 'lib/oddb2xml/refdata_cleanup.rb', line 18

def self.single_substance?(swissmedic_substance)
  return false if swissmedic_substance.nil?
  str = swissmedic_substance.to_s.strip
  return false if str.empty?
  !str.include?(",")
end