Module: Oddb2xml::RefdataCleanup

Defined in:
lib/oddb2xml/refdata_cleanup.rb

Overview

Compensates for known data-quality issues in upstream Refdata.Articles.xml before they reach the generated output. Each fix is opt-in and guarded by a heuristic against Swissmedic data so we never alter genuine combination products. See GitHub issue #112 for the catalogue of upstream problems.

Constant Summary collapse

DOSE_TOKEN =
/\d+(?:[.,]\d+)?\s*(?:mg|µg|mcg|g|ml|UI|U\.I\.|IE|%)/i
DOUBLE_DOSE_RE =

Matches “<dose> / <same dose> /” – the templating bug where Refdata repeats the strength once. The backreference 1 only matches when the exact same dose string appears twice, which keeps real combos (e.g. PHESGO 600 mg / 600 mg / 10 ml) safe – those are caught by the single_substance? guard, but the literal-match also acts as a backstop.

/(#{DOSE_TOKEN})\s*\/\s*\1\s*\/\s*/
GALENIC_NORMALISATIONS =

Case #13 (issue #112): a handful of products spell the galenic form out in full (“RINVOQ Retardtabletten 30 mg 28 Stk”) while the Refdata house style abbreviates it everywhere else (“Ret Tabl”, 940 other DE names). Normalise the spelled-out form to the abbreviation so the outliers match the convention. The keys are German-only words (FR/IT use “comprimé …” / “compresse …”), so applying this to FR/IT descriptions is a safe no-op.

{
  /\bRetardtabletten\b/ => "Ret Tabl"
}.freeze
COMBO_DOSE_IKSNR =

The following three fixes reconstruct dose information that Refdata dropped from <FullName>, sourcing the authoritative values from the Swissmedic “Zugelassene Packungen” composition string (already loaded as pack, keyed by the same SwissmedicNo8). See issue #112 cases #4 (missing strength), #6 (missing 2nd combo component) and #7 (missing injection volume).

Each fix is scoped to an explicit allow-list of Swissmedic registration numbers (IKSNR, the first 5 digits of the no8). A blanket heuristic is NOT safe: a dry run over the full Refdata feed mis-fired on hundreds of legitimate names — combination detection grabbed sodium counter-ion doses (“KEPPRA … / 2.8 mg”), the missing-strength rule fired on strength-less phyto/powder products (“IMPORTAL Pulver”), and the volume rule corrupted concentration names (“CIMZIA 200 mg/ml”). Restricting to the catalogued registrations keeps the Swissmedic-derived value while touching only the known-bad products. Add an IKSNR here once a new case is confirmed.

%w[65280].freeze
MISSING_DOSE_IKSNR =

#6 ATOVAQUON PLUS Spirig HC

%w[62568].freeze
MISSING_VOLUME_IKSNR =

#4 CETIRIZIN Spirig HC

%w[69696].freeze
METOJECT_IKSNR =

#1 METOJECT Autoinjektor

%w[65672].freeze
METOJECT_SUFFIX =

Localised “<pen> … <count> <unit>” suffix, selected by the galenic-form token Refdata uses per language. The “<brand> Autoinjektor <dose>/<vol>” prefix is identical across DE/FR/IT, so only the suffix is localised.

{
  /\bInj Lös\b/ => ["Fertpen", "Stk"],         # DE
  /\binj sol\b/ => ["stylo pré", "pce"],       # FR
  /\bsol inj\b/ => ["penna preriempita", "pz"] # IT
}.freeze
VERACTIV_VITD3_IKSNR =

#3 VERACTIV Vitamin D3 Wild

%w[57690].freeze

Class Method Summary collapse

Class Method Details

.dose_for_substance(composition, substance) ⇒ Object

Returns the dose token that belongs to a named active substance in the Swissmedic composition, normalised to “<number> <unit>” (e.g. dose_for_substance(comp, “atovaquonum”) => “250 mg”). Matches within the comma-delimited segment that names the substance so excipient doses are never picked up. Returns nil if absent.



89
90
91
92
93
94
95
96
97
98
99
100
101
# File 'lib/oddb2xml/refdata_cleanup.rb', line 89

def self.dose_for_substance(composition, substance)
  return nil if composition.nil? || substance.nil?
  key = substance.to_s.strip[/\A[A-Za-zÀ-ÿ]+/]
  return nil if key.nil? || key.empty?
  composition.split(",").each do |segment|
    next unless /\b#{Regexp.escape(key)}/i.match?(segment)
    m = segment.match(DOSE_TOKEN)
    next unless m
    parts = m[0].match(/\A([\d.,]+)\s*(.+?)\s*\z/)
    return parts ? "#{parts[1]} #{parts[2]}" : m[0].strip
  end
  nil
end

.dose_regex(dose) ⇒ Object

Builds a whitespace-tolerant matcher for a normalised dose value like “250 mg” so it also matches “250mg” in a description.



78
79
80
81
82
# File 'lib/oddb2xml/refdata_cleanup.rb', line 78

def self.dose_regex(dose)
  m = dose.to_s.match(/\A([\d.,]+)\s*(.+?)\s*\z/)
  return /#{Regexp.escape(dose.to_s)}/i unless m
  /(?<![\d.,])#{Regexp.escape(m[1])}\s*#{Regexp.escape(m[2])}/i
end

.fix_double_dose(desc, swissmedic_substance) ⇒ Object

Removes the duplicated dose token in mono products. Returns the cleaned description, or the original string if no change applies.



27
28
29
30
31
32
# File 'lib/oddb2xml/refdata_cleanup.rb', line 27

def self.fix_double_dose(desc, swissmedic_substance)
  return desc if desc.nil? || desc.empty?
  return desc unless DOUBLE_DOSE_RE.match?(desc)
  return desc unless single_substance?(swissmedic_substance)
  desc.sub(DOUBLE_DOSE_RE, '\1 / ')
end

.fix_missing_combo_dose(desc, swissmedic_substance, composition, no8) ⇒ Object

Case #6: a real combination product whose Refdata description carries only the first component’s strength (e.g. “ATOVAQUON PLUS … 250 mg …”). Appends the second active’s strength from Swissmedic, producing “… 250 mg / 100 mg …”. No-op for mono products, 3+ component combos, or when the second strength is already present.



108
109
110
111
112
113
114
115
116
117
118
119
120
# File 'lib/oddb2xml/refdata_cleanup.rb', line 108

def self.fix_missing_combo_dose(desc, swissmedic_substance, composition, no8)
  return desc if desc.nil? || desc.empty?
  return desc unless COMBO_DOSE_IKSNR.include?(iksnr_of(no8))
  return desc if single_substance?(swissmedic_substance)
  subs = swissmedic_substance.to_s.split(",").map(&:strip)
  return desc unless subs.size == 2
  d1 = dose_for_substance(composition, subs[0])
  d2 = dose_for_substance(composition, subs[1])
  return desc unless d1 && d2
  return desc unless dose_regex(d1).match?(desc)
  return desc if dose_regex(d2).match?(desc)
  desc.sub(dose_regex(d1)) { |hit| "#{hit} / #{d2}" }
end

.fix_missing_dose(desc, swissmedic_substance, composition, no8) ⇒ Object

Case #4: a mono product whose Refdata description carries NO strength at all (e.g. “CETIRIZIN Spirig HC Filmtabl 10 Stk”). Inserts the single active’s strength from Swissmedic before the trailing “<count> <unit>” group → “CETIRIZIN Spirig HC Filmtabl 10 mg 10 Stk”. No-op when a strength is already present or no trailing pack count exists.



127
128
129
130
131
132
133
134
135
136
# File 'lib/oddb2xml/refdata_cleanup.rb', line 127

def self.fix_missing_dose(desc, swissmedic_substance, composition, no8)
  return desc if desc.nil? || desc.empty?
  return desc unless MISSING_DOSE_IKSNR.include?(iksnr_of(no8))
  return desc unless single_substance?(swissmedic_substance)
  return desc if DOSE_TOKEN.match?(desc)
  dose = dose_for_substance(composition, swissmedic_substance)
  return desc unless dose
  return desc unless /\s\d[\d.,']*\s+\S+\s*\z/.match?(desc)
  desc.sub(/(\s)(\d[\d.,']*\s+\S+\s*)\z/, "\\1#{dose} \\2")
end

.fix_missing_volume(desc, composition, no8) ⇒ Object

Case #7: an injectable pen/solution whose Refdata description gives the strength but not the per-pen volume (e.g. “MOUNJARO KwikPen Inj Lös 7.5 mg 1 Stk”). Appends “/<vol> ml” taken from the Swissmedic composition (“… pro 0.6 ml …”) → “… 7.5 mg/0.6 ml 1 Stk”. Only fires for injectable forms that have no volume anywhere in the name yet.



143
144
145
146
147
148
149
150
151
152
153
# File 'lib/oddb2xml/refdata_cleanup.rb', line 143

def self.fix_missing_volume(desc, composition, no8)
  return desc if desc.nil? || desc.empty?
  return desc unless MISSING_VOLUME_IKSNR.include?(iksnr_of(no8))
  return desc unless /\b(?:Inj|Fertpen|Injektor|stylo|sol\b)/i.match?(desc)
  return desc if /\d\s*ml\b/i.match?(desc)
  vol = composition.to_s[/\bpro\s+([\d.,]+)\s*ml\b/i, 1]
  return desc unless vol
  m = desc.match(/\d+(?:[.,]\d+)?\s*mg/i)
  return desc unless m
  desc.sub(m[0], "#{m[0]}/#{vol} ml")
end

.fix_truncated_metoject(desc, no8, size) ⇒ Object

Case #1: every METOJECT Autoinjektor name is truncated at Refdata’s 50-char limit, carrying a redundant strength in the (often cut) tail (“METOJECT Autoinjektor 10 mg/0.2 ml Inj Lös 10 mg 1”). Rebuild from the intact prefix plus the authoritative Swissmedic pack size →“METOJECT Autoinjektor 10 mg/0.2 ml Fertpen 1 Stk” (localised for FR/IT). Scoped to the METOJECT registration; idempotent once Refdata stops truncating (the rebuilt name no longer carries the redundant tail).



173
174
175
176
177
178
179
180
181
182
183
# File 'lib/oddb2xml/refdata_cleanup.rb', line 173

def self.fix_truncated_metoject(desc, no8, size)
  return desc if desc.nil? || desc.empty?
  return desc unless METOJECT_IKSNR.include?(iksnr_of(no8))
  return desc if size.nil? || size.to_s.empty?
  m = desc.match(%r{\A(METOJECT Autoinjektor \d[\d.]* mg/\d[\d.]* ml)\b})
  return desc unless m
  suffix = METOJECT_SUFFIX.find { |re, _| re.match?(desc) }
  return desc unless suffix
  pen, unit = suffix.last
  "#{m[1]} #{pen} #{size} #{unit}"
end

.fix_truncated_volume_unit(desc, no8) ⇒ Object

Case #3 (partial): the VERACTIV Vitamin D3 drops are truncated at 50 chars, losing the final “l” of the volume (“… 20’000 U.I. 10m” → “10ml”). Restore it. The French wording (“Huile”, drop-form codes) in the German name is a separate upstream issue and is left untouched. Scoped to the registration; a no-op once the volume already ends in “ml”.



192
193
194
195
196
# File 'lib/oddb2xml/refdata_cleanup.rb', line 192

def self.fix_truncated_volume_unit(desc, no8)
  return desc if desc.nil? || desc.empty?
  return desc unless VERACTIV_VITD3_IKSNR.include?(iksnr_of(no8))
  desc.sub(/(\d)\s*m\z/, '\1ml')
end

.iksnr_of(no8) ⇒ Object

#7 MOUNJARO KwikPen



72
73
74
# File 'lib/oddb2xml/refdata_cleanup.rb', line 72

def self.iksnr_of(no8)
  no8.to_s[0, 5]
end

.normalize_galenic_form(desc) ⇒ Object

Normalises spelled-out German galenic forms to the Refdata house-style abbreviation. Returns the cleaned description, or the original string if no rule applies.



47
48
49
50
# File 'lib/oddb2xml/refdata_cleanup.rb', line 47

def self.normalize_galenic_form(desc)
  return desc if desc.nil? || desc.empty?
  GALENIC_NORMALISATIONS.reduce(desc) { |result, (re, repl)| result.gsub(re, repl) }
end

.single_substance?(swissmedic_substance) ⇒ Boolean

A Swissmedic compositions cell like “mirtazapinum” indicates a mono product; “atovaquonum, proguanili hydrochloridum” or “pertuzumabum, trastuzumabum” indicates a real combination.

Returns:

  • (Boolean)


18
19
20
21
22
23
# File 'lib/oddb2xml/refdata_cleanup.rb', line 18

def self.single_substance?(swissmedic_substance)
  return false if swissmedic_substance.nil?
  str = swissmedic_substance.to_s.strip
  return false if str.empty?
  !str.include?(",")
end