Module: Oddb2xml::RefdataCleanup

Defined in:
lib/oddb2xml/refdata_cleanup.rb

Overview

Compensates for known data-quality issues in upstream Refdata.Articles.xml before they reach the generated output. Each fix is opt-in and guarded by a heuristic against Swissmedic data so we never alter genuine combination products. See GitHub issue #112 for the catalogue of upstream problems.

Constant Summary collapse

DOSE_TOKEN =
/\d+(?:[.,]\d+)?\s*(?:mg|µg|mcg|g|ml|UI|U\.I\.|IE|%)/i
DOUBLE_DOSE_RE =

Matches “<dose> / <same dose> /” – the templating bug where Refdata repeats the strength once. The backreference 1 only matches when the exact same dose string appears twice, which keeps real combos (e.g. PHESGO 600 mg / 600 mg / 10 ml) safe – those are caught by the single_substance? guard, but the literal-match also acts as a backstop.

/(#{DOSE_TOKEN})\s*\/\s*\1\s*\/\s*/
GALENIC_NORMALISATIONS =

Case #13 (issue #112): a handful of products spell the galenic form out in full (“RINVOQ Retardtabletten 30 mg 28 Stk”) while the Refdata house style abbreviates it everywhere else (“Ret Tabl”, 940 other DE names). Normalise the spelled-out form to the abbreviation so the outliers match the convention. The keys are German-only words (FR/IT use “comprimé …” / “compresse …”), so applying this to FR/IT descriptions is a safe no-op.

{
  /\bRetardtabletten\b/ => "Ret Tabl"
}.freeze

Class Method Summary collapse

Class Method Details

.fix_double_dose(desc, swissmedic_substance) ⇒ Object

Removes the duplicated dose token in mono products. Returns the cleaned description, or the original string if no change applies.



27
28
29
30
31
32
# File 'lib/oddb2xml/refdata_cleanup.rb', line 27

def self.fix_double_dose(desc, swissmedic_substance)
  return desc if desc.nil? || desc.empty?
  return desc unless DOUBLE_DOSE_RE.match?(desc)
  return desc unless single_substance?(swissmedic_substance)
  desc.sub(DOUBLE_DOSE_RE, '\1 / ')
end

.normalize_galenic_form(desc) ⇒ Object

Normalises spelled-out German galenic forms to the Refdata house-style abbreviation. Returns the cleaned description, or the original string if no rule applies.



47
48
49
50
# File 'lib/oddb2xml/refdata_cleanup.rb', line 47

def self.normalize_galenic_form(desc)
  return desc if desc.nil? || desc.empty?
  GALENIC_NORMALISATIONS.reduce(desc) { |result, (re, repl)| result.gsub(re, repl) }
end

.single_substance?(swissmedic_substance) ⇒ Boolean

A Swissmedic compositions cell like “mirtazapinum” indicates a mono product; “atovaquonum, proguanili hydrochloridum” or “pertuzumabum, trastuzumabum” indicates a real combination.

Returns:

  • (Boolean)


18
19
20
21
22
23
# File 'lib/oddb2xml/refdata_cleanup.rb', line 18

def self.single_substance?(swissmedic_substance)
  return false if swissmedic_substance.nil?
  str = swissmedic_substance.to_s.strip
  return false if str.empty?
  !str.include?(",")
end