Module: Oddb2xml::RefdataCleanup
- Defined in:
- lib/oddb2xml/refdata_cleanup.rb
Overview
Compensates for known data-quality issues in upstream Refdata.Articles.xml before they reach the generated output. Each fix is opt-in and guarded by a heuristic against Swissmedic data so we never alter genuine combination products. See GitHub issue #112 for the catalogue of upstream problems.
Constant Summary collapse
- DOSE_TOKEN =
/\d+(?:[.,]\d+)?\s*(?:mg|µg|mcg|g|ml|UI|U\.I\.|IE|%)/i- DOUBLE_DOSE_RE =
Matches “<dose> / <same dose> /” – the templating bug where Refdata repeats the strength once. The backreference 1 only matches when the exact same dose string appears twice, which keeps real combos (e.g. PHESGO 600 mg / 600 mg / 10 ml) safe – those are caught by the single_substance? guard, but the literal-match also acts as a backstop.
/(#{DOSE_TOKEN})\s*\/\s*\1\s*\/\s*/- GALENIC_NORMALISATIONS =
Case #13 (issue #112): a handful of products spell the galenic form out in full (“RINVOQ Retardtabletten 30 mg 28 Stk”) while the Refdata house style abbreviates it everywhere else (“Ret Tabl”, 940 other DE names). Normalise the spelled-out form to the abbreviation so the outliers match the convention. The keys are German-only words (FR/IT use “comprimé …” / “compresse …”), so applying this to FR/IT descriptions is a safe no-op.
{ /\bRetardtabletten\b/ => "Ret Tabl" }.freeze
- COMBO_DOSE_IKSNR =
The following three fixes reconstruct dose information that Refdata dropped from <FullName>, sourcing the authoritative values from the Swissmedic “Zugelassene Packungen” composition string (already loaded as pack, keyed by the same SwissmedicNo8). See issue #112 cases #4 (missing strength), #6 (missing 2nd combo component) and #7 (missing injection volume).
Each fix is scoped to an explicit allow-list of Swissmedic registration numbers (IKSNR, the first 5 digits of the no8). A blanket heuristic is NOT safe: a dry run over the full Refdata feed mis-fired on hundreds of legitimate names — combination detection grabbed sodium counter-ion doses (“KEPPRA … / 2.8 mg”), the missing-strength rule fired on strength-less phyto/powder products (“IMPORTAL Pulver”), and the volume rule corrupted concentration names (“CIMZIA 200 mg/ml”). Restricting to the catalogued registrations keeps the Swissmedic-derived value while touching only the known-bad products. Add an IKSNR here once a new case is confirmed.
%w[65280].freeze
- MISSING_DOSE_IKSNR =
#6 ATOVAQUON PLUS Spirig HC
%w[62568].freeze
- MISSING_VOLUME_IKSNR =
#4 CETIRIZIN Spirig HC
%w[69696].freeze
- METOJECT_IKSNR =
#1 METOJECT Autoinjektor
%w[65672].freeze
- METOJECT_SUFFIX =
Localised “<pen> … <count> <unit>” suffix, selected by the galenic-form token Refdata uses per language. The “<brand> Autoinjektor <dose>/<vol>” prefix is identical across DE/FR/IT, so only the suffix is localised.
{ /\bInj Lös\b/ => ["Fertpen", "Stk"], # DE /\binj sol\b/ => ["stylo pré", "pce"], # FR /\bsol inj\b/ => ["penna preriempita", "pz"] # IT }.freeze
- VERACTIV_VITD3_IKSNR =
#3 VERACTIV Vitamin D3 Wild
%w[57690].freeze
Class Method Summary collapse
-
.dose_for_substance(composition, substance) ⇒ Object
Returns the dose token that belongs to a named active substance in the Swissmedic composition, normalised to “<number> <unit>” (e.g. dose_for_substance(comp, “atovaquonum”) => “250 mg”).
-
.dose_regex(dose) ⇒ Object
Builds a whitespace-tolerant matcher for a normalised dose value like “250 mg” so it also matches “250mg” in a description.
-
.fix_double_dose(desc, swissmedic_substance) ⇒ Object
Removes the duplicated dose token in mono products.
-
.fix_missing_combo_dose(desc, swissmedic_substance, composition, no8) ⇒ Object
Case #6: a real combination product whose Refdata description carries only the first component’s strength (e.g. “ATOVAQUON PLUS … 250 mg …”).
-
.fix_missing_dose(desc, swissmedic_substance, composition, no8) ⇒ Object
Case #4: a mono product whose Refdata description carries NO strength at all (e.g. “CETIRIZIN Spirig HC Filmtabl 10 Stk”).
-
.fix_missing_volume(desc, composition, no8) ⇒ Object
Case #7: an injectable pen/solution whose Refdata description gives the strength but not the per-pen volume (e.g. “MOUNJARO KwikPen Inj Lös 7.5 mg 1 Stk”).
-
.fix_truncated_metoject(desc, no8, size) ⇒ Object
Case #1: every METOJECT Autoinjektor name is truncated at Refdata’s 50-char limit, carrying a redundant strength in the (often cut) tail (“METOJECT Autoinjektor 10 mg/0.2 ml Inj Lös 10 mg 1”).
-
.fix_truncated_volume_unit(desc, no8) ⇒ Object
Case #3 (partial): the VERACTIV Vitamin D3 drops are truncated at 50 chars, losing the final “l” of the volume (“… 20’000 U.I. 10m” → “10ml”).
-
.iksnr_of(no8) ⇒ Object
#7 MOUNJARO KwikPen.
-
.normalize_galenic_form(desc) ⇒ Object
Normalises spelled-out German galenic forms to the Refdata house-style abbreviation.
-
.single_substance?(swissmedic_substance) ⇒ Boolean
A Swissmedic compositions cell like “mirtazapinum” indicates a mono product; “atovaquonum, proguanili hydrochloridum” or “pertuzumabum, trastuzumabum” indicates a real combination.
Class Method Details
.dose_for_substance(composition, substance) ⇒ Object
Returns the dose token that belongs to a named active substance in the Swissmedic composition, normalised to “<number> <unit>” (e.g. dose_for_substance(comp, “atovaquonum”) => “250 mg”). Matches within the comma-delimited segment that names the substance so excipient doses are never picked up. Returns nil if absent.
89 90 91 92 93 94 95 96 97 98 99 100 101 |
# File 'lib/oddb2xml/refdata_cleanup.rb', line 89 def self.dose_for_substance(composition, substance) return nil if composition.nil? || substance.nil? key = substance.to_s.strip[/\A[A-Za-zÀ-ÿ]+/] return nil if key.nil? || key.empty? composition.split(",").each do |segment| next unless /\b#{Regexp.escape(key)}/i.match?(segment) m = segment.match(DOSE_TOKEN) next unless m parts = m[0].match(/\A([\d.,]+)\s*(.+?)\s*\z/) return parts ? "#{parts[1]} #{parts[2]}" : m[0].strip end nil end |
.dose_regex(dose) ⇒ Object
Builds a whitespace-tolerant matcher for a normalised dose value like “250 mg” so it also matches “250mg” in a description.
78 79 80 81 82 |
# File 'lib/oddb2xml/refdata_cleanup.rb', line 78 def self.dose_regex(dose) m = dose.to_s.match(/\A([\d.,]+)\s*(.+?)\s*\z/) return /#{Regexp.escape(dose.to_s)}/i unless m /(?<![\d.,])#{Regexp.escape(m[1])}\s*#{Regexp.escape(m[2])}/i end |
.fix_double_dose(desc, swissmedic_substance) ⇒ Object
Removes the duplicated dose token in mono products. Returns the cleaned description, or the original string if no change applies.
27 28 29 30 31 32 |
# File 'lib/oddb2xml/refdata_cleanup.rb', line 27 def self.fix_double_dose(desc, swissmedic_substance) return desc if desc.nil? || desc.empty? return desc unless DOUBLE_DOSE_RE.match?(desc) return desc unless single_substance?(swissmedic_substance) desc.sub(DOUBLE_DOSE_RE, '\1 / ') end |
.fix_missing_combo_dose(desc, swissmedic_substance, composition, no8) ⇒ Object
Case #6: a real combination product whose Refdata description carries only the first component’s strength (e.g. “ATOVAQUON PLUS … 250 mg …”). Appends the second active’s strength from Swissmedic, producing “… 250 mg / 100 mg …”. No-op for mono products, 3+ component combos, or when the second strength is already present.
108 109 110 111 112 113 114 115 116 117 118 119 120 |
# File 'lib/oddb2xml/refdata_cleanup.rb', line 108 def self.fix_missing_combo_dose(desc, swissmedic_substance, composition, no8) return desc if desc.nil? || desc.empty? return desc unless COMBO_DOSE_IKSNR.include?(iksnr_of(no8)) return desc if single_substance?(swissmedic_substance) subs = swissmedic_substance.to_s.split(",").map(&:strip) return desc unless subs.size == 2 d1 = dose_for_substance(composition, subs[0]) d2 = dose_for_substance(composition, subs[1]) return desc unless d1 && d2 return desc unless dose_regex(d1).match?(desc) return desc if dose_regex(d2).match?(desc) desc.sub(dose_regex(d1)) { |hit| "#{hit} / #{d2}" } end |
.fix_missing_dose(desc, swissmedic_substance, composition, no8) ⇒ Object
Case #4: a mono product whose Refdata description carries NO strength at all (e.g. “CETIRIZIN Spirig HC Filmtabl 10 Stk”). Inserts the single active’s strength from Swissmedic before the trailing “<count> <unit>” group → “CETIRIZIN Spirig HC Filmtabl 10 mg 10 Stk”. No-op when a strength is already present or no trailing pack count exists.
127 128 129 130 131 132 133 134 135 136 |
# File 'lib/oddb2xml/refdata_cleanup.rb', line 127 def self.fix_missing_dose(desc, swissmedic_substance, composition, no8) return desc if desc.nil? || desc.empty? return desc unless MISSING_DOSE_IKSNR.include?(iksnr_of(no8)) return desc unless single_substance?(swissmedic_substance) return desc if DOSE_TOKEN.match?(desc) dose = dose_for_substance(composition, swissmedic_substance) return desc unless dose return desc unless /\s\d[\d.,']*\s+\S+\s*\z/.match?(desc) desc.sub(/(\s)(\d[\d.,']*\s+\S+\s*)\z/, "\\1#{dose} \\2") end |
.fix_missing_volume(desc, composition, no8) ⇒ Object
Case #7: an injectable pen/solution whose Refdata description gives the strength but not the per-pen volume (e.g. “MOUNJARO KwikPen Inj Lös 7.5 mg 1 Stk”). Appends “/<vol> ml” taken from the Swissmedic composition (“… pro 0.6 ml …”) → “… 7.5 mg/0.6 ml 1 Stk”. Only fires for injectable forms that have no volume anywhere in the name yet.
143 144 145 146 147 148 149 150 151 152 153 |
# File 'lib/oddb2xml/refdata_cleanup.rb', line 143 def self.fix_missing_volume(desc, composition, no8) return desc if desc.nil? || desc.empty? return desc unless MISSING_VOLUME_IKSNR.include?(iksnr_of(no8)) return desc unless /\b(?:Inj|Fertpen|Injektor|stylo|sol\b)/i.match?(desc) return desc if /\d\s*ml\b/i.match?(desc) vol = composition.to_s[/\bpro\s+([\d.,]+)\s*ml\b/i, 1] return desc unless vol m = desc.match(/\d+(?:[.,]\d+)?\s*mg/i) return desc unless m desc.sub(m[0], "#{m[0]}/#{vol} ml") end |
.fix_truncated_metoject(desc, no8, size) ⇒ Object
Case #1: every METOJECT Autoinjektor name is truncated at Refdata’s 50-char limit, carrying a redundant strength in the (often cut) tail (“METOJECT Autoinjektor 10 mg/0.2 ml Inj Lös 10 mg 1”). Rebuild from the intact prefix plus the authoritative Swissmedic pack size →“METOJECT Autoinjektor 10 mg/0.2 ml Fertpen 1 Stk” (localised for FR/IT). Scoped to the METOJECT registration; idempotent once Refdata stops truncating (the rebuilt name no longer carries the redundant tail).
173 174 175 176 177 178 179 180 181 182 183 |
# File 'lib/oddb2xml/refdata_cleanup.rb', line 173 def self.fix_truncated_metoject(desc, no8, size) return desc if desc.nil? || desc.empty? return desc unless METOJECT_IKSNR.include?(iksnr_of(no8)) return desc if size.nil? || size.to_s.empty? m = desc.match(%r{\A(METOJECT Autoinjektor \d[\d.]* mg/\d[\d.]* ml)\b}) return desc unless m suffix = METOJECT_SUFFIX.find { |re, _| re.match?(desc) } return desc unless suffix pen, unit = suffix.last "#{m[1]} #{pen} #{size} #{unit}" end |
.fix_truncated_volume_unit(desc, no8) ⇒ Object
Case #3 (partial): the VERACTIV Vitamin D3 drops are truncated at 50 chars, losing the final “l” of the volume (“… 20’000 U.I. 10m” → “10ml”). Restore it. The French wording (“Huile”, drop-form codes) in the German name is a separate upstream issue and is left untouched. Scoped to the registration; a no-op once the volume already ends in “ml”.
192 193 194 195 196 |
# File 'lib/oddb2xml/refdata_cleanup.rb', line 192 def self.fix_truncated_volume_unit(desc, no8) return desc if desc.nil? || desc.empty? return desc unless VERACTIV_VITD3_IKSNR.include?(iksnr_of(no8)) desc.sub(/(\d)\s*m\z/, '\1ml') end |
.iksnr_of(no8) ⇒ Object
#7 MOUNJARO KwikPen
72 73 74 |
# File 'lib/oddb2xml/refdata_cleanup.rb', line 72 def self.iksnr_of(no8) no8.to_s[0, 5] end |
.normalize_galenic_form(desc) ⇒ Object
Normalises spelled-out German galenic forms to the Refdata house-style abbreviation. Returns the cleaned description, or the original string if no rule applies.
47 48 49 50 |
# File 'lib/oddb2xml/refdata_cleanup.rb', line 47 def self.normalize_galenic_form(desc) return desc if desc.nil? || desc.empty? GALENIC_NORMALISATIONS.reduce(desc) { |result, (re, repl)| result.gsub(re, repl) } end |
.single_substance?(swissmedic_substance) ⇒ Boolean
A Swissmedic compositions cell like “mirtazapinum” indicates a mono product; “atovaquonum, proguanili hydrochloridum” or “pertuzumabum, trastuzumabum” indicates a real combination.
18 19 20 21 22 23 |
# File 'lib/oddb2xml/refdata_cleanup.rb', line 18 def self.single_substance?(swissmedic_substance) return false if swissmedic_substance.nil? str = swissmedic_substance.to_s.strip return false if str.empty? !str.include?(",") end |