Module: Oddb2xml::RefdataCleanup
- Defined in:
- lib/oddb2xml/refdata_cleanup.rb
Overview
Compensates for known data-quality issues in upstream Refdata.Articles.xml before they reach the generated output. Each fix is opt-in and guarded by a heuristic against Swissmedic data so we never alter genuine combination products. See GitHub issue #112 for the catalogue of upstream problems.
Constant Summary collapse
- DOSE_TOKEN =
/\d+(?:[.,]\d+)?\s*(?:mg|µg|mcg|g|ml|UI|U\.I\.|IE|%)/i- DOUBLE_DOSE_RE =
Matches “<dose> / <same dose> /” – the templating bug where Refdata repeats the strength once. The backreference 1 only matches when the exact same dose string appears twice, which keeps real combos (e.g. PHESGO 600 mg / 600 mg / 10 ml) safe – those are caught by the single_substance? guard, but the literal-match also acts as a backstop.
/(#{DOSE_TOKEN})\s*\/\s*\1\s*\/\s*/- GALENIC_NORMALISATIONS =
Case #13 (issue #112): a handful of products spell the galenic form out in full (“RINVOQ Retardtabletten 30 mg 28 Stk”) while the Refdata house style abbreviates it everywhere else (“Ret Tabl”, 940 other DE names). Normalise the spelled-out form to the abbreviation so the outliers match the convention. The keys are German-only words (FR/IT use “comprimé …” / “compresse …”), so applying this to FR/IT descriptions is a safe no-op.
{ /\bRetardtabletten\b/ => "Ret Tabl" }.freeze
Class Method Summary collapse
-
.fix_double_dose(desc, swissmedic_substance) ⇒ Object
Removes the duplicated dose token in mono products.
-
.normalize_galenic_form(desc) ⇒ Object
Normalises spelled-out German galenic forms to the Refdata house-style abbreviation.
-
.single_substance?(swissmedic_substance) ⇒ Boolean
A Swissmedic compositions cell like “mirtazapinum” indicates a mono product; “atovaquonum, proguanili hydrochloridum” or “pertuzumabum, trastuzumabum” indicates a real combination.
Class Method Details
.fix_double_dose(desc, swissmedic_substance) ⇒ Object
Removes the duplicated dose token in mono products. Returns the cleaned description, or the original string if no change applies.
27 28 29 30 31 32 |
# File 'lib/oddb2xml/refdata_cleanup.rb', line 27 def self.fix_double_dose(desc, swissmedic_substance) return desc if desc.nil? || desc.empty? return desc unless DOUBLE_DOSE_RE.match?(desc) return desc unless single_substance?(swissmedic_substance) desc.sub(DOUBLE_DOSE_RE, '\1 / ') end |
.normalize_galenic_form(desc) ⇒ Object
Normalises spelled-out German galenic forms to the Refdata house-style abbreviation. Returns the cleaned description, or the original string if no rule applies.
47 48 49 50 |
# File 'lib/oddb2xml/refdata_cleanup.rb', line 47 def self.normalize_galenic_form(desc) return desc if desc.nil? || desc.empty? GALENIC_NORMALISATIONS.reduce(desc) { |result, (re, repl)| result.gsub(re, repl) } end |
.single_substance?(swissmedic_substance) ⇒ Boolean
A Swissmedic compositions cell like “mirtazapinum” indicates a mono product; “atovaquonum, proguanili hydrochloridum” or “pertuzumabum, trastuzumabum” indicates a real combination.
18 19 20 21 22 23 |
# File 'lib/oddb2xml/refdata_cleanup.rb', line 18 def self.single_substance?(swissmedic_substance) return false if swissmedic_substance.nil? str = swissmedic_substance.to_s.strip return false if str.empty? !str.include?(",") end |