Module: Scrapetor::Native

Defined in:
lib/scrapetor/native.rb,
lib/scrapetor/native_dom.rb,
ext/scrapetor/native/scrapetor_native.c

Overview

Bridge to the native streaming extraction engine.

If the C extension is loadable, Native.available? is true and Native.compile_descriptor turns a Schema into the flat format the C side consumes. Schemas using features outside the native fast-path subset (combinators, pseudo-classes, nested repeated groups, top-level fields without a repeated context) compile to nil, and the Extractor falls back to the Ruby path.

Defined Under Namespace

Classes: DocumentWrapper, Element

Constant Summary collapse

SYNTHETIC_ROOT =

Compile a Schema into the descriptor format the C side consumes.

desc   = [group, group, ...]
group  = [name_sym, sel, fields_array]
field  = [name_sym, sel, attr_str_or_nil, type_sym, clean_bool,
          normalize_url_bool, multi_bool]
sel    = [tag_or_nil, classes_array, id_or_nil, attrs_array]
attrs_array = [[name_str, op_str_or_nil, val_str_or_nil], ...]

Returns nil if the schema uses features the native path doesn’t support yet.

:__scrapetor_root__
HTML_ROOT_SEL =
["html", [], nil, []].freeze
AVAILABLE_DOM =

Wrapper module — ‘Scrapetor::Native::Document` is a TypedData class defined in C (see ext/scrapetor/native/scrapetor_dom.c). It exposes node-id based accessors. This module adds Ruby-level helpers and the Element wrapper that `Scrapetor::Node` can wrap and operate on the same way it does over a pure-Ruby `Dom::Element`.

defined?(Scrapetor::Native::Document)
PSEUDO_ELEMENT_RE =

—– pseudo-element handling at the css() boundary —–

/(::(?:text|attr\([^)]+\)|first-letter|first-line|before|after))\s*\z/i.freeze
NATIVE_PSEUDO_FLAGS =

Mirrors C_PS_* in ext/scrapetor/native/scrapetor_dom.c. Keep in sync.

{
  "first-child"       => 1 << 0,
  "last-child"        => 1 << 1,
  "only-child"        => 1 << 2,
  "first-of-type"     => 1 << 3,
  "last-of-type"      => 1 << 4,
  "only-of-type"      => 1 << 5,
  "empty"             => 1 << 6,
  "root"              => 1 << 7,
  "checked"           => 1 << 8,
  "disabled"          => 1 << 9,
  "enabled"           => 1 << 10,
  "required"          => 1 << 11,
  "optional"          => 1 << 12,
  "read-only"         => 1 << 13,
  "read-write"        => 1 << 14,
  "any-link"          => 1 << 15,
  "link"              => 1 << 15,
  "scope"             => 1 << 23
}.freeze
NATIVE_NTH_BITS =
{
  "nth-child"          => 1 << 16,
  "nth-last-child"     => 1 << 17,
  "nth-of-type"        => 1 << 18,
  "nth-last-of-type"   => 1 << 19
}.freeze
NATIVE_PSEUDO_FALLBACK =
:__scrapetor_native_fallback__
HET_PSEUDO_CACHE =
{}
HET_PSEUDO_CACHE_CAP =
1024
IS_AT_BOUNDARY_RE =

‘:is(A, B C)`-distribution. Finds a `:is(…)` / `:matches(…)` / `:where(…)` token that sits at an atom boundary (i.e. preceded and followed by start/end/combinator/whitespace) and whose alternatives include at least one with a combinator/whitespace inside. Returns one group string per alternative, with the alternative substituted in. Without this rewrite a selector like `:is(aside, main .x) .y` falls back to the Ruby DOM parser because the native engine can’t represent multi-atom alternatives inside ‘:is`. Returns `[group_str]` (single element) when no rewrite applies — caller treats that as a no-op.

/
  (?:\A|(?<=[\s>+~,]))
  :(?:is|matches|where)\(
/x.freeze

Class Method Summary collapse

Class Method Details

.available?Boolean

Returns:

  • (Boolean)


22
23
24
# File 'lib/scrapetor/native.rb', line 22

def self.available?
  AVAILABLE
end

.build_descriptor(schema) ⇒ Object



79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# File 'lib/scrapetor/native.rb', line 79

def self.build_descriptor(schema)
  groups = []

  # Top-level fields become a synthetic group bound to the <html>
  # element. The Document layer unwraps the single result back into
  # the top of the response hash. Fragments without <html> fall back
  # to the Ruby path.
  if schema.fields.any?
    field_descs = schema.fields.map { |f| compile_field(f) }
    return nil if field_descs.any?(&:nil?)
    groups << [SYNTHETIC_ROOT, HTML_ROOT_SEL, field_descs]
  end

  schema.groups.each do |g|
    gd = compile_group(g)
    return nil unless gd
    groups << gd
  end

  return nil if groups.empty?
  groups
end

.compile_descriptor(schema) ⇒ Object

Memoised on the Schema instance — the descriptor Array tree is identical for every call against the same schema, so rebuilding it on each extract was just GC pressure. Both successful descriptors and the “can’t compile” outcome are cached.



44
45
46
47
48
49
50
51
52
53
# File 'lib/scrapetor/native.rb', line 44

def self.compile_descriptor(schema)
  cached = schema.instance_variable_get(:@__scrapetor_native_desc)
  unless cached.nil?
    return cached == false ? nil : cached
  end

  desc = build_descriptor(schema)
  schema.instance_variable_set(:@__scrapetor_native_desc, desc.nil? ? false : desc)
  desc
end

.compile_extract_fields(fields, wrapper) ⇒ Object

Returns true if the comma-separated selector has groups with different pseudo-element shapes — e.g. ‘.a > ::text, .b` — so callers can split + peel per-group instead of one shared peel. Compile a => selector_string fields map into the parallel (keys, plans, kinds, args) arrays the C extract_one_native / extract_each_native entry points consume. Returns the 4-tuple on success, nil when any selector can’t be compiled natively (caller falls back to the slow per-row at_css loop).

kinds:

0 = Element  (C side allocates the wrapper)
1 = ::text   (TextNode of subtree text)
2 = ::attr   (TextNode of attribute value)

plan = nil + kind = 2 means bare ‘::attr(name)` against the scope element itself — the C side reads the attribute directly without running a plan. The peel + plan-cache lookups here cost a few hundred nanoseconds and are amortised across every iteration of the resulting C-side loop.



2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
# File 'lib/scrapetor/native_dom.rb', line 2167

def self.compile_extract_fields(fields, wrapper)
  keys  = []
  plans = []
  kinds = []
  args  = []
  fields.each_pair do |key, sel|
    keys << key
    sel_str = sel.is_a?(String) ? sel : sel.to_s
    stripped, kind, arg = peel_pseudo_element(sel_str)
    stripped = "*" if stripped.empty? && kind.nil?
    if stripped.empty? && (kind == :attr || kind == :direct_attr)
      plans << nil; kinds << 2; args << arg.to_s
      next
    end
    return nil if stripped.include?(",")
    plan = wrapper.compiled_plan(stripped)
    return nil unless plan
    plans << plan
    case kind
    when :text, :text_approx then kinds << 1; args << ""
    when :attr               then kinds << 2; args << arg.to_s
    when nil                 then kinds << 0; args << ""
    else return nil   # :direct_text / :direct_attr / unsupported
    end
  end
  [keys, plans, kinds, args]
end

.compile_field(field) ⇒ Object



115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# File 'lib/scrapetor/native.rb', line 115

def self.compile_field(field)
  # Features the native engine doesn't yet support — fall back to Ruby.
  return nil if field.selector.is_a?(Array)
  return nil if field.transform
  return nil unless field.default.nil?
  return nil if field.required
  return nil if %i[html list json boolean].include?(field.type)

  # Try simple selector first.
  simple = parse_selector(field.selector)
  if simple
    return [field.name, simple, field.attr_str, field.type,
            !!field.clean, !!field.normalize_url, !!field.multi,
            nil, nil]
  end

  # Try combinator selector.
  combo = parse_selector_with_combinator(field.selector)
  if combo
    primary, combinator, context = combo
    return [field.name, primary, field.attr_str, field.type,
            !!field.clean, !!field.normalize_url, !!field.multi,
            context, combinator]
  end

  nil
end

.compile_group(group) ⇒ Object



102
103
104
105
106
107
108
109
110
111
112
113
# File 'lib/scrapetor/native.rb', line 102

def self.compile_group(group)
  sel = parse_selector(group.selector)
  return nil unless sel
  return nil unless group.groups.empty? # nested groups: Ruby fallback
  fields = []
  group.fields.each do |f|
    fd = compile_field(f)
    return nil unless fd
    fields << fd
  end
  [group.name, sel, fields]
end

.compile_selector_chain(selector_str) ⇒ Object



1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
# File 'lib/scrapetor/native_dom.rb', line 1568

def self.compile_selector_chain(selector_str)
  plan = Scrapetor::Selector.compile(selector_str)
  out = []
  plan.each do |atom|
    pseudo_data = nil
    if atom.pseudos && !atom.pseudos.empty?
      pseudo_data = native_pseudo_data(atom.pseudos)
      return nil if pseudo_data == NATIVE_PSEUDO_FALLBACK
    end
    sel = [
      atom.tag ? atom.tag.to_s : nil,
      atom.classes,
      atom.id,
      atom.attrs,
      pseudo_data
    ]
    combo =
      case atom.combinator
      when :descendant then "descendant"
      when :child      then "child"
      when :adj        then "adjacent"
      when :gen        then "sibling"
      else nil
      end
    out << [sel, combo]
  end
  out
rescue ArgumentError
  nil
end

.expand_is_groups(group_str, force: false) ⇒ Object



2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
# File 'lib/scrapetor/native_dom.rb', line 2222

def self.expand_is_groups(group_str, force: false)
  m = IS_AT_BOUNDARY_RE.match(group_str)
  return [group_str] unless m
  paren_start = m.end(0) - 1   # position of '('
  depth = 1
  i = paren_start + 1
  len = group_str.length
  while i < len && depth > 0
    ch = group_str[i]
    if ch == "("
      depth += 1
    elsif ch == ")"
      depth -= 1
    end
    i += 1
  end
  return [group_str] if depth != 0
  paren_end = i - 1  # position of matching ')'
  inner = group_str[(paren_start + 1)...paren_end]
  alts = split_selector_groups(inner)
  return [group_str] if alts.size < 2
  # By default only distribute when an alternative has a combinator
  # (multi-atom) — single-atom alternatives compile natively as
  # is_inner. When called from inside `:has`, force distribution so
  # the inner pool sees plain single atoms rather than `:is(...)`
  # wrappers that don't fit native_inner_simples.
  multi = alts.any? { |a| a =~ /[\s>+~]/ }
  return [group_str] unless multi || force
  prefix = group_str[0...m.begin(0)]
  suffix = group_str[(paren_end + 1)..]
  alts.flat_map do |alt|
    merged = "#{prefix}#{alt}#{suffix}".strip
    expand_is_groups(merged, force: force)
  end
end

.extract(html_v, desc_v, base_url_v) ⇒ Object

—- entrypoint —————————————————-



1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
# File 'ext/scrapetor/native/scrapetor_native.c', line 1085

static VALUE scrapetor_extract(VALUE self, VALUE html_v, VALUE desc_v, VALUE base_url_v) {
    (void)self;
    Check_Type(html_v, T_STRING);

    ctx_t ctx;
    memset(&ctx, 0, sizeof(ctx));
    ctx.html       = RSTRING_PTR(html_v);
    ctx.len        = (size_t)RSTRING_LEN(html_v);
    ctx.pos        = 0;
    ctx.sp         = 0;
    ctx.active_gi  = -1;
    ctx.record     = Qnil;
    ctx.base_url   = NIL_P(base_url_v) ? Qnil : base_url_v;
    if (!NIL_P(ctx.base_url)) {
        Check_Type(ctx.base_url, T_STRING);
        ctx.base_url_p      = RSTRING_PTR(ctx.base_url);
        ctx.base_url_len    = (size_t)RSTRING_LEN(ctx.base_url);
        ctx.base_origin_len = compute_origin_len(ctx.base_url_p, (long)ctx.base_url_len);
    }
    for (int i = 0; i < MAX_FIELDS; i++) fbuf_reset(&ctx.ftext[i]);

    if (!parse_descriptor(desc_v, &ctx)) {
        for (int i = 0; i < MAX_FIELDS; i++) fbuf_free(&ctx.ftext[i]);
        rb_raise(rb_eArgError, "scrapetor_native: invalid schema descriptor");
    }

    scan(&ctx);

    if (ctx.active_gi >= 0) {
        group_t *g = &ctx.groups[ctx.active_gi];
        for (int i = 0; i < g->n_fields; i++) {
            if (!ctx.fdone[i] && ctx.ftext[i].len > 0) finalize_field(&ctx, ctx.active_gi, i);
        }
        rb_ary_push(g->results, ctx.record);
    }

    VALUE result = rb_hash_new();
    for (int gi = 0; gi < ctx.n_groups; gi++) {
        rb_hash_aset(result, ID2SYM(ctx.groups[gi].name), ctx.groups[gi].results);
    }

    for (int i = 0; i < MAX_FIELDS; i++) fbuf_free(&ctx.ftext[i]);

    return result;
}

.has_combinator?(s) ⇒ Boolean

Returns:

  • (Boolean)


197
198
199
200
201
202
203
204
205
206
207
# File 'lib/scrapetor/native.rb', line 197

def self.has_combinator?(s)
  depth = 0
  s.each_char do |ch|
    if ch == "["       then depth += 1
    elsif ch == "]"    then depth -= 1 if depth.positive?
    elsif depth.zero?
      return true if [" ", "\t", "\n", ">", "+", "~"].include?(ch)
    end
  end
  false
end

.has_text_child_form?(arg) ⇒ Boolean

‘:has(>::text)` / `:has(::text)` — “node has at least one direct text-node child”. The compile would otherwise reject the bare pseudo-element inside :has, forcing the whole selector to the Ruby Dom fallback. Cheap-as-shrimp shape detector — just trims whitespace and an optional leading `>`.

Returns:

  • (Boolean)


1791
1792
1793
1794
1795
1796
# File 'lib/scrapetor/native_dom.rb', line 1791

def self.has_text_child_form?(arg)
  return false if arg.nil?
  s = arg.strip
  s = s[1..].lstrip if s.start_with?(">")
  s == "::text"
end

.heterogeneous_pseudo_groups?(s) ⇒ Boolean

Returns:

  • (Boolean)


2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
# File 'lib/scrapetor/native_dom.rb', line 2197

def self.heterogeneous_pseudo_groups?(s)
  cached = HET_PSEUDO_CACHE[s]
  return cached unless cached.nil?
  groups = split_selector_groups(s)
  kinds = groups.map { |g| peel_pseudo_element(g)[1] }
  result = kinds.uniq.size > 1
  HET_PSEUDO_CACHE.shift if HET_PSEUDO_CACHE.size >= HET_PSEUDO_CACHE_CAP
  HET_PSEUDO_CACHE[s] = result
  result
end

.inner_pool_for(arg) ⇒ Object

Compile a ‘:not(arg)` / `:has(arg)` payload as a list of leaf simple atoms (no further pseudo recursion). Used to fill an inner pool on a c_simple_atom — limit one level deep.



2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
# File 'lib/scrapetor/native_dom.rb', line 2047

def self.inner_pool_for(arg)
  return nil if arg.nil? || arg.empty?
  groups = Scrapetor::Dom::Selectors.selector_groups(arg)
  out = []
  groups.each do |g|
    plan = Scrapetor::Selector.compile(g)
    return nil if plan.size != 1
    atom = plan.first
    if pure_is_atom?(atom)
      sub = inner_pool_for(atom.pseudos.first[1])
      return nil if sub.nil?
      out.concat(sub)
      next
    end
    leaf_pseudo = nil
    if atom.pseudos && !atom.pseudos.empty?
      leaf_pseudo = native_leaf_pseudo_data(atom.pseudos)
      return nil if leaf_pseudo.nil?
    end
    entry = [atom.tag ? atom.tag.to_s : nil, atom.classes, atom.id, atom.attrs]
    entry << leaf_pseudo if leaf_pseudo
    out << entry
  end
  out
rescue ArgumentError
  nil
end

.native_inner_simple_pseudo(pseudos) ⇒ Object

Build the extended pseudo_data slot for a c_simple_atom that itself carries ‘:not(simple)` / `:has(simple)` / `:not(:has(simple))` constraints. The C layer reads optional indices 5, 6, 7 as inner_not / inner_has / inner_not_has pools and applies them in matches_simple_atom. Returns nil when the shape isn’t supported (sibling combinators inside, recursive pseudos beyond one level, etc.) — the caller falls back to native_leaf_pseudo_data which rejects the atom entirely if leaves aren’t enough.



1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
# File 'lib/scrapetor/native_dom.rb', line 1985

def self.native_inner_simple_pseudo(pseudos)
  flags = 0
  nth_a = nth_b = 0
  nth_type_a = nth_type_b = 0
  inner_not = []
  inner_has = []
  inner_not_has = []
  inner_has_chain = nil
  pseudos.each do |name, arg, double_colon|
    return nil if double_colon
    if (bit = NATIVE_PSEUDO_FLAGS[name])
      flags |= bit
    elsif (bit = NATIVE_NTH_BITS[name])
      a, b = Scrapetor::Selector.parse_nth(arg)
      return nil unless a
      flags |= bit
      if name == "nth-of-type" || name == "nth-last-of-type"
        nth_type_a, nth_type_b = a, b
      else
        nth_a, nth_b = a, b
      end
    elsif name == "not"
      # `:not(:has(simple))` → inner_not_has
      if (nh = parse_inner_not_has_form(arg))
        inner_not_has.concat(nh)
        next
      end
      sub = inner_pool_for(arg)
      return nil if sub.nil?
      inner_not.concat(sub)
    elsif name == "has"
      # Try simple-atom inner first.
      sub = inner_pool_for(arg)
      if sub
        inner_has.concat(sub)
      elsif (chain = parse_has_chains_form(arg))
        # Multi-atom chain alternatives. Lift into inner_has_chain
        # so the C engine evaluates the chain match natively.
        inner_has_chain = chain
      else
        return nil
      end
    else
      return nil
    end
  end
  out = [flags, nth_a, nth_b, nth_type_a, nth_type_b]
  # Pad with empty arrays as needed so the C layer indexes work.
  need_8 = inner_has_chain && !inner_has_chain.empty?
  need_7 = need_8 || !inner_not_has.empty?
  need_6 = need_7 || !inner_has.empty?
  need_5 = need_6 || !inner_not.empty?
  out << inner_not        if need_5
  out << inner_has        if need_6
  out << inner_not_has    if need_7
  out << inner_has_chain  if need_8
  out
end

.native_inner_simples(arg, depth = 0) ⇒ Object

Compile an inner-selector argument (‘:not(.x, :empty, .y)`) into an array of simple-atom descriptors the C engine can read. Each inner is `[tag, classes, id, attrs]` or, when pseudo flags are present, `[tag, classes, id, attrs, leaf_pseudo_data]`. Combinators and recursive pseudos (a `:not` inside a `:not`) still force the Ruby fallback — the C side only flattens one level deep.



1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
# File 'lib/scrapetor/native_dom.rb', line 1927

def self.native_inner_simples(arg, depth = 0)
  return NATIVE_PSEUDO_FALLBACK if arg.nil? || arg.empty?
  return NATIVE_PSEUDO_FALLBACK if depth > 4
  groups = Scrapetor::Dom::Selectors.selector_groups(arg)
  out = []
  groups.each do |g|
    plan = Scrapetor::Selector.compile(g)
    return NATIVE_PSEUDO_FALLBACK if plan.size != 1
    atom = plan.first
    # `:has(:is(X, Y))` / `:not(:is(X, Y))` etc.: unwrap a pure
    # `:is(...)` atom into its alternatives so the inner pool
    # receives the leaf simples without the recursive :is.
    if pure_is_atom?(atom)
      inner_arg = atom.pseudos.first[1]
      sub = native_inner_simples(inner_arg, depth + 1)
      return NATIVE_PSEUDO_FALLBACK if sub == NATIVE_PSEUDO_FALLBACK
      out.concat(sub)
      next
    end
    leaf_pseudo = nil
    if atom.pseudos && !atom.pseudos.empty?
      # Try the nested (one-level-recursive) shape first — accepts
      # `:not(simple)` / `:has(simple)` / `:not(:has(simple))` on the
      # inner atom, lifting them into inner pools on the inner
      # c_simple_atom. Falls back to leaf-only if that doesn't apply.
      leaf_pseudo = native_inner_simple_pseudo(atom.pseudos) ||
                    native_leaf_pseudo_data(atom.pseudos)
      return NATIVE_PSEUDO_FALLBACK if leaf_pseudo.nil?
    end
    entry = [atom.tag ? atom.tag.to_s : nil, atom.classes, atom.id, atom.attrs]
    entry << leaf_pseudo if leaf_pseudo
    out << entry
  end
  out
rescue ArgumentError
  NATIVE_PSEUDO_FALLBACK
end

.native_leaf_pseudo_data(pseudos) ⇒ Object

Like native_pseudo_data, but rejects any pseudo that requires a nested sub-selector (‘:not`/`:is`/`:has`). The C `c_simple_atom` only has the leaf pseudo fields; the recursive ones would need their own inner pool which we don’t allocate.



2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
# File 'lib/scrapetor/native_dom.rb', line 2097

def self.native_leaf_pseudo_data(pseudos)
  flags = 0
  nth_a = nth_b = 0
  nth_type_a = nth_type_b = 0
  pseudos.each do |name, arg, double_colon|
    return nil if double_colon
    if (bit = NATIVE_PSEUDO_FLAGS[name])
      flags |= bit
    elsif (bit = NATIVE_NTH_BITS[name])
      a, b = Scrapetor::Selector.parse_nth(arg)
      return nil unless a
      flags |= bit
      if name == "nth-of-type" || name == "nth-last-of-type"
        nth_type_a, nth_type_b = a, b
      else
        nth_a, nth_b = a, b
      end
    else
      return nil
    end
  end
  [flags, nth_a, nth_b, nth_type_a, nth_type_b]
end

.native_pseudo_data(pseudos) ⇒ Object

Compile the Atom#pseudos list into the eight-element Array the C side reads. Returns NATIVE_PSEUDO_FALLBACK if any pseudo is outside the native subset (in which case the whole chain falls back to Ruby).



1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
# File 'lib/scrapetor/native_dom.rb', line 1603

def self.native_pseudo_data(pseudos)
  flags = 0
  nth_a = nth_b = 0
  nth_type_a = nth_type_b = 0
  not_inner = []
  is_inner = []
  has_inner = []
  not_has_inner = []
  has_child_inner = []
  not_has_child_inner = []
  has_chain_inner = []
  not_has_chain_inner = []

  pseudos.each do |name, arg, double_colon|
    return NATIVE_PSEUDO_FALLBACK if double_colon

    if (bit = NATIVE_PSEUDO_FLAGS[name])
      flags |= bit
    elsif (bit = NATIVE_NTH_BITS[name])
      a, b = Scrapetor::Selector.parse_nth(arg)
      return NATIVE_PSEUDO_FALLBACK unless a
      flags |= bit
      if name == "nth-of-type" || name == "nth-last-of-type"
        nth_type_a, nth_type_b = a, b
      else
        nth_a, nth_b = a, b
      end
    elsif name == "not"
      # `:not(:has(X, Y))` — common scrape pattern. Rather than
      # forcing a Ruby Dom fallback (which is ~3-5 ms per call on a
      # 100KB page), recognise the shape at compile time and emit
      # a C_PS_NOT_HAS bit on the outer atom. The C side checks
      # "no descendant matches any of these simple atoms" — same
      # cost as C_PS_HAS, just inverted.
      if (nh = parse_not_has_form(arg))
        not_has_inner.concat(nh)
        flags |= (1 << 24)
        next
      end
      # `:not(:has(> X))` direct-child variant.
      if (nhc = parse_not_has_child_form(arg))
        not_has_child_inner.concat(nhc)
        flags |= (1 << 26)
        next
      end
      # `:not(:has(X Y, A B, ...))` — chain inner with multiple
      # alternatives. Mirrors `:has(X Y, A B)` (1<<27) but with
      # the negated descendant check.
      if (nchains = parse_not_has_chains_form(arg))
        not_has_chain_inner = nchains
        flags |= (1 << 29)
        next
      end
      inner = native_inner_simples(arg)
      return NATIVE_PSEUDO_FALLBACK if inner == NATIVE_PSEUDO_FALLBACK
      not_inner.concat(inner)
      flags |= (1 << 20)
    elsif name == "is" || name == "matches" || name == "where"
      inner = native_inner_simples(arg)
      return NATIVE_PSEUDO_FALLBACK if inner == NATIVE_PSEUDO_FALLBACK
      is_inner.concat(inner)
      flags |= (1 << 21)
    elsif name == "has"
      # `:has(>::text)` / `:has(::text)` — "node has a direct
      # text-node child". Non-standard but appears in production
      # parsers. Maps to a one-bit flag the C side evaluates with
      # a single child walk.
      if has_text_child_form?(arg)
        flags |= (1 << 28)
        next
      end
      # `:has(> X, > Y)` — leading combinator inside :has. The
      # arg's compile output starts with `:scope` (compile()
      # desugars the leading `>`), giving each group two atoms.
      # native_inner_simples requires a single atom, so detect
      # this shape explicitly and lift the *child* atoms into
      # has_child_inner.
      if (hc = parse_has_child_form(arg))
        has_child_inner.concat(hc)
        flags |= (1 << 25)
        next
      end
      # `:has(+ X, + Y)` / `:has(~ X, ~ Y)` — sibling-from-scope
      # variants. Same lifting machinery but the walk is on the
      # outer node's siblings, not its descendants.
      if (hs = parse_has_sib_form(arg, "+"))
        has_inner.concat(hs)
        flags |= (1 << 30)
        next
      end
      if (hs = parse_has_sib_form(arg, "~"))
        has_inner.concat(hs)
        flags |= (1 << 31)
        next
      end
      # `:is(...)` inside :has: distribute alternatives so an inner
      # like `:is(h2, span).a-color-base` becomes
      # `h2.a-color-base, span.a-color-base` before we hand it to
      # native_inner_simples (which needs single-atom groups). Force
      # distribution even for single-atom alternatives — the comma-
      # joined form is exactly the shape native_inner_simples wants.
      arg_expanded = Native.split_selector_groups(arg)
        .flat_map { |g| Native.expand_is_groups(g, force: true) }
        .join(", ")
      inner = native_inner_simples(arg_expanded)
      if inner != NATIVE_PSEUDO_FALLBACK
        has_inner.concat(inner)
        flags |= (1 << 22)
        next
      end
      # `:has(X Y, A B, ...)` — multi-chain. Each comma alternative
      # is its own chain of simple atoms with descendant/child/
      # sibling combinators between them. The native engine matches
      # if ANY chain has a descendant match.
      if (chains = parse_has_chains_form(arg))
        has_chain_inner = chains
        flags |= (1 << 27)
        next
      end
      return NATIVE_PSEUDO_FALLBACK
    else
      return NATIVE_PSEUDO_FALLBACK
    end
  end

  [flags, nth_a, nth_b, nth_type_a, nth_type_b, not_inner, is_inner, has_inner,
   not_has_inner, has_child_inner, not_has_child_inner, has_chain_inner,
   not_has_chain_inner]
end

.parse_has_chain_form(arg) ⇒ Object

‘:has(X Y)` — single chain (no commas, no leading combinator). The arg’s compile output is multiple atoms joined by descendant/child combinators. Returns an Array of [simple_atom_entry, combo_str] pairs (combo_str is “descendant” / “child” / nil). Rejects forms native_inner_simples already handles (single atom) and forms that need recursive pseudos.



1804
1805
1806
1807
1808
# File 'lib/scrapetor/native_dom.rb', line 1804

def self.parse_has_chain_form(arg)
  r = parse_has_chains_form(arg)
  return nil if r.nil? || r.size != 1
  r.first
end

.parse_has_chains_form(arg) ⇒ Object

‘:has(X Y, A B, …)` — multi-chain. Returns an Array of chains. Each chain is an Array of [atom_entry, combinator_string] pairs. The first entry’s combinator is nil; subsequent entries carry descendant/child/adjacent/sibling. Returns nil when any group’s shape isn’t a supported chain form (no recursive pseudos beyond leaf, etc.). Single-atom alternatives are also lifted as 1-long chains so the caller doesn’t have to distinguish.



1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
# File 'lib/scrapetor/native_dom.rb', line 1817

def self.parse_has_chains_form(arg)
  return nil if arg.nil? || arg.empty?
  groups = Scrapetor::Dom::Selectors.selector_groups(arg)
  return nil if groups.empty? || groups.size > 8
  chains = []
  groups.each do |g|
    plan = Scrapetor::Selector.compile(g)
    return nil if plan.empty?
    chain = []
    plan.each_with_index do |atom, idx|
      leaf_pseudo = nil
      if atom.pseudos && !atom.pseudos.empty?
        leaf_pseudo = native_inner_simple_pseudo(atom.pseudos) ||
                      native_leaf_pseudo_data(atom.pseudos)
        return nil if leaf_pseudo.nil?
      end
      entry = [atom.tag ? atom.tag.to_s : nil, atom.classes, atom.id, atom.attrs]
      entry << leaf_pseudo if leaf_pseudo
      combo =
        case atom.combinator
        when :descendant then "descendant"
        when :child      then "child"
        when :adj        then "adjacent"
        when :gen        then "sibling"
        when nil         then (idx.zero? ? nil : "descendant")
        else                  nil
        end
      chain << [entry, combo]
    end
    chains << chain
  end
  chains
rescue ArgumentError
  nil
end

.parse_has_child_form(arg) ⇒ Object

‘:has(> X, > Y)` — every group of the argument must be of shape `:scope > simple`. Returns the simple atoms (each is the right side of the `>`) if so, nil otherwise.



1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
# File 'lib/scrapetor/native_dom.rb', line 1856

def self.parse_has_child_form(arg)
  return nil if arg.nil? || arg.empty?
  groups = Scrapetor::Dom::Selectors.selector_groups(arg)
  out = []
  groups.each do |g|
    gs = g.strip
    return nil unless gs.start_with?(">")
    inner = gs[1..].lstrip
    plan = Scrapetor::Selector.compile(inner)
    return nil if plan.size != 1
    atom = plan.first
    leaf_pseudo = nil
    if atom.pseudos && !atom.pseudos.empty?
      leaf_pseudo = native_leaf_pseudo_data(atom.pseudos)
      return nil if leaf_pseudo.nil?
    end
    entry = [atom.tag ? atom.tag.to_s : nil, atom.classes, atom.id, atom.attrs]
    entry << leaf_pseudo if leaf_pseudo
    out << entry
  end
  out
rescue ArgumentError
  nil
end

.parse_has_sib_form(arg, combinator_char) ⇒ Object

‘:has(+ X, + Y)` / `:has(~ X, ~ Y)` — every group of the argument must start with the given sibling combinator. Returns the list of leaf simple-atom entries (right of the combinator) on success.



1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
# File 'lib/scrapetor/native_dom.rb', line 1761

def self.parse_has_sib_form(arg, combinator_char)
  return nil if arg.nil? || arg.empty?
  groups = Scrapetor::Dom::Selectors.selector_groups(arg)
  out = []
  groups.each do |g|
    gs = g.strip
    return nil unless gs.start_with?(combinator_char)
    inner = gs[1..].lstrip
    plan = Scrapetor::Selector.compile(inner)
    return nil if plan.size != 1
    atom = plan.first
    leaf_pseudo = nil
    if atom.pseudos && !atom.pseudos.empty?
      leaf_pseudo = native_leaf_pseudo_data(atom.pseudos)
      return nil if leaf_pseudo.nil?
    end
    entry = [atom.tag ? atom.tag.to_s : nil, atom.classes, atom.id, atom.attrs]
    entry << leaf_pseudo if leaf_pseudo
    out << entry
  end
  out
rescue ArgumentError
  nil
end

.parse_inner_not_has_form(arg) ⇒ Object

‘:not(:has(simple))` payload — used by inner_simple_pseudo to lift the nested negation into inner_not_has on the simple atom.



2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
# File 'lib/scrapetor/native_dom.rb', line 2077

def self.parse_inner_not_has_form(arg)
  return nil if arg.nil? || arg.empty?
  groups = Scrapetor::Dom::Selectors.selector_groups(arg)
  return nil if groups.size != 1
  plan = Scrapetor::Selector.compile(groups.first)
  return nil if plan.size != 1
  atom = plan.first
  return nil unless atom.pseudos && atom.pseudos.size == 1
  name, inner_arg, double_colon = atom.pseudos.first
  return nil if double_colon || name != "has"
  return nil if atom.tag || !atom.classes.empty? || atom.id || !atom.attrs.empty?
  inner_pool_for(inner_arg)
rescue ArgumentError
  nil
end

.parse_not_has_chain_form(arg) ⇒ Object

‘:not(:has(X Y))` — :not wrapping a single :has with a multi-atom chain. Returns the chain shape (same as parse_has_chain_form) or nil. The matching is the negated descendant-chain check.



1736
1737
1738
1739
1740
# File 'lib/scrapetor/native_dom.rb', line 1736

def self.parse_not_has_chain_form(arg)
  r = parse_not_has_chains_form(arg)
  return nil if r.nil? || r.size != 1
  r.first
end

.parse_not_has_chains_form(arg) ⇒ Object



1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
# File 'lib/scrapetor/native_dom.rb', line 1742

def self.parse_not_has_chains_form(arg)
  return nil if arg.nil? || arg.empty?
  groups = Scrapetor::Dom::Selectors.selector_groups(arg)
  return nil if groups.size != 1
  plan = Scrapetor::Selector.compile(groups.first)
  return nil if plan.size != 1
  atom = plan.first
  return nil unless atom.pseudos && atom.pseudos.size == 1
  name, inner_arg, double_colon = atom.pseudos.first
  return nil if double_colon || name != "has"
  return nil if atom.tag || !atom.classes.empty? || atom.id || !atom.attrs.empty?
  parse_has_chains_form(inner_arg)
rescue ArgumentError
  nil
end

.parse_not_has_child_form(arg) ⇒ Object

‘:not(:has(> X))` — direct-child negative form.



1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
# File 'lib/scrapetor/native_dom.rb', line 1882

def self.parse_not_has_child_form(arg)
  return nil if arg.nil? || arg.empty?
  groups = Scrapetor::Dom::Selectors.selector_groups(arg)
  return nil if groups.size != 1
  plan = Scrapetor::Selector.compile(groups.first)
  return nil if plan.size != 1
  atom = plan.first
  return nil unless atom.pseudos && atom.pseudos.size == 1
  name, inner_arg, double_colon = atom.pseudos.first
  return nil if double_colon || name != "has"
  return nil if atom.tag || !atom.classes.empty? || atom.id || !atom.attrs.empty?
  parse_has_child_form(inner_arg)
rescue ArgumentError
  nil
end

.parse_not_has_form(arg) ⇒ Object

Inspect a ‘:not(…)` argument; if the argument compiles to exactly `:has(simple, simple, …)` (no other tag/class/id/attr constraints outside the :has), return the array of inner simple-atom forms so the caller can lift them into the C_PS_NOT_HAS path. Returns nil for anything else.



1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
# File 'lib/scrapetor/native_dom.rb', line 1903

def self.parse_not_has_form(arg)
  return nil if arg.nil? || arg.empty?
  groups = Scrapetor::Dom::Selectors.selector_groups(arg)
  return nil if groups.size != 1
  plan = Scrapetor::Selector.compile(groups.first)
  return nil if plan.size != 1
  atom = plan.first
  return nil unless atom.pseudos && atom.pseudos.size == 1
  name, inner_arg, double_colon = atom.pseudos.first
  return nil if double_colon || name != "has"
  return nil if atom.tag || !atom.classes.empty? || atom.id || !atom.attrs.empty?
  inner = native_inner_simples(inner_arg)
  return nil if inner == NATIVE_PSEUDO_FALLBACK
  inner
rescue ArgumentError
  nil
end

.parse_selector(selector) ⇒ Object

Parse a simple CSS selector into the [tag, classes, id, attrs] form that the C engine accepts. Returns nil if the selector uses combinators or pseudo-classes (those force the Ruby fallback).

Supported:

tag                    div
.class                 .product-card
tag.class.other        span.price.big
#id                    #main
tag#id                 article#main
[attr]                 [data-sku]
[attr=val]             [data-sku="A1"]
[attr*=val]            [class*=card]
[attr^=val]            [href^=https]
[attr$=val]            [href$=.pdf]
[attr~=val]            [class~=primary]
[attr|=val]            [lang|=en]
... and combinations


227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
# File 'lib/scrapetor/native.rb', line 227

def self.parse_selector(selector)
  return nil unless selector
  s = selector.to_s.strip
  return nil if s.empty?
  # Check for combinators / unsupported syntax outside [...] brackets,
  # since `*` and `~` are valid inside attribute operators.
  outside = s.gsub(/\[[^\]]*\]/, "")
  return nil if outside =~ /[\s>+~,*]/

  tag      = nil
  classes  = []
  id       = nil
  attrs    = []

  i = 0
  if (m = s[i..].match(/\A([a-zA-Z][\w-]*)/))
    tag = m[1].downcase
    i += m[0].length
  end

  while i < s.length
    case s[i]
    when "."
      m = s[i..].match(/\A\.([\w-]+)/)
      return nil unless m
      classes << m[1]
      i += m[0].length
    when "#"
      m = s[i..].match(/\A#([\w-]+)/)
      return nil unless m
      return nil if id # only one id allowed
      id = m[1]
      i += m[0].length
    when "["
      # Mirror Scrapetor::Selector::ATTR_RE — same quote-style-aware
      # value extraction so an attribute like `[class*="L'appareil"]`
      # parses without choking on the embedded apostrophe.
      m = s[i..].match(/
        \A\[
          ([\w:\-\u{0080}-\u{10FFFF}]+)
          (?:
            ([*^$~|]?=)
            (?:
              "((?:[^"\\]|\\.)*)"
            | '((?:[^'\\]|\\.)*)'
            | ([^\]\s]+)
            )
          )?
        \]
      /x)
      return nil unless m
      attrs << [m[1], m[2], (m[3] || m[4] || m[5])]
      i += m[0].length
    else
      return nil
    end
  end

  return nil if tag.nil? && classes.empty? && id.nil? && attrs.empty?
  return nil if classes.size > 8 || attrs.size > 8

  [tag, classes, id, attrs]
end

.parse_selector_with_combinator(selector) ⇒ Object

Parse a CSS selector with at most one combinator (‘A B` or `A > B`). Returns [primary_sel, combinator_str, context_sel] or nil if the input has multiple combinators or other unsupported syntax.



146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
# File 'lib/scrapetor/native.rb', line 146

def self.parse_selector_with_combinator(selector)
  s = selector.to_s.strip
  return nil if s.empty?

  # Split on first combinator at top level (outside [...] groups).
  split = split_at_combinator(s)
  return nil unless split
  left_str, combinator, right_str = split

  left  = parse_selector(left_str)
  right = parse_selector(right_str)
  return nil unless left && right

  [right, combinator, left]
end

.peel_pseudo_element(selector_str) ⇒ Object

‘::text` and `::attr(name)` are Scrapy/Parsel-style pseudo-elements: they reshape the result of a selector into strings rather than affecting matching. Strip them before running the query and apply the transform on the way out.

Returns [stripped_selector, transform_kind, arg]

transform_kind = nil | :text | :attr | :text_approx

Fast-path skip when the selector has no ‘::` substring (the common case) — saves a regex match on every css() call.



43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# File 'lib/scrapetor/native_dom.rb', line 43

def self.peel_pseudo_element(selector_str)
  s = selector_str
  return [s, nil, nil] unless s.include?("::")
  m = s.match(PSEUDO_ELEMENT_RE)
  return [s, nil, nil] unless m
  head = s[0...m.begin(0)].rstrip
  pe = m[1]
  # `head > ::text` and `head > ::attr(x)`: strip the trailing `>`
  # combinator and flip kind into the direct-only variant. The
  # native plan compiles cleanly for `head` and apply_pseudo_element
  # walks only the immediate children when collecting text/attrs.
  direct = false
  if head.end_with?(">")
    head = head[0..-2].rstrip
    direct = true
  end
  if pe.casecmp("::text").zero?
    [head, direct ? :direct_text : :text, nil]
  elsif (a = pe.match(/::attr\(([^)]+)\)/i))
    [head, direct ? :direct_attr : :attr, a[1].strip]
  else
    [head, :text_approx, nil]
  end
end

.pure_is_atom?(atom) ⇒ Boolean

An atom that is only ‘:is(…)` — no tag/class/id/attrs and no other pseudos — so the `:is` wraps a list of alternatives that can be unwrapped into the surrounding inner pool. Anything else on the atom (e.g. `.x:is(…)`) would change semantics and isn’t eligible for this rewrite.

Returns:

  • (Boolean)


1970
1971
1972
1973
1974
1975
# File 'lib/scrapetor/native_dom.rb', line 1970

def self.pure_is_atom?(atom)
  return false if atom.tag || !atom.classes.empty? || atom.id || !atom.attrs.empty?
  return false unless atom.pseudos && atom.pseudos.size == 1
  name, _arg, double_colon = atom.pseudos.first
  !double_colon && %w[is matches where].include?(name)
end

.split_at_combinator(s) ⇒ Object



162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# File 'lib/scrapetor/native.rb', line 162

def self.split_at_combinator(s)
  depth = 0
  i = 0
  while i < s.length
    ch = s[i]
    if ch == "["
      depth += 1
    elsif ch == "]"
      depth -= 1 if depth.positive?
    elsif depth.zero?
      if ch == ">"
        left = s[0...i].strip
        right = s[(i + 1)..].strip
        return nil if left.empty? || right.empty?
        # Reject if there are further combinators in either half.
        return nil if has_combinator?(left) || has_combinator?(right)
        return [left, "child", right]
      elsif ch == " " || ch == "\t" || ch == "\n"
        left = s[0...i].strip
        rest = s[(i + 1)..].lstrip
        next i += 1 if rest.empty?
        # The next non-whitespace char must not be > / + / ~ — those
        # are picked up on their own iteration.
        if !left.empty? && !"<>+~,".include?(rest[0] || "")
          right = rest
          return nil if has_combinator?(left) || has_combinator?(right)
          return [left, "descendant", right]
        end
      end
    end
    i += 1
  end
  nil
end

.split_descriptor(schema, kind) ⇒ Object

For “mixed” schemas (top-level fields + at least one repeated group) the C engine needs two passes — one for the groups, one for the synthetic root holding the fields. We split the schema here, memoise the result on the original Schema instance so the allocations only happen once, and let callers run the two extractions back-to-back.



61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# File 'lib/scrapetor/native.rb', line 61

def self.split_descriptor(schema, kind)
  ivar = (kind == :groups ? :@__scrapetor_split_groups : :@__scrapetor_split_fields)
  cached = schema.instance_variable_get(ivar)
  unless cached.nil?
    return cached == false ? nil : cached
  end

  sub = Schema.new
  if kind == :groups
    schema.groups.each { |g| sub.groups << g }
  else
    schema.fields.each { |f| sub.fields << f }
  end
  desc = build_descriptor(sub)
  schema.instance_variable_set(ivar, desc.nil? ? false : desc)
  desc
end

.split_selector_groups(s) ⇒ Object

Split a CSS selector on top-level commas (outside […] and (…)).



2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
# File 'lib/scrapetor/native_dom.rb', line 2122

def self.split_selector_groups(s)
  groups = []
  buf = +""
  depth = 0
  paren = 0
  s.each_char do |ch|
    case ch
    when "[" then depth += 1; buf << ch
    when "]" then depth -= 1 if depth.positive?; buf << ch
    when "(" then paren += 1; buf << ch
    when ")" then paren -= 1 if paren.positive?; buf << ch
    when ","
      if depth.zero? && paren.zero?
        groups << buf.strip
        buf = +""
      else
        buf << ch
      end
    else
      buf << ch
    end
  end
  groups << buf.strip
  groups.reject(&:empty?)
end

.wrap_text_nodes!(arr) ⇒ Object

Wrap each String entry in TextNode so Node-style ‘.text` / `.content` accessors and Parsel-style `.get` / `.getall` both work. Skips nil (`bulk_attr` returns nil for missing attributes) and any value that’s already a TextNode. Mutates in place to avoid a second Array allocation on the result-collection hot path.



21
22
23
24
25
26
27
28
29
30
31
# File 'lib/scrapetor/native_dom.rb', line 21

def self.wrap_text_nodes!(arr)
  return arr unless arr.is_a?(Array)
  i = 0
  n = arr.length
  while i < n
    v = arr[i]
    arr[i] = Scrapetor::TextNode.new(v) if v.is_a?(String) && !v.is_a?(Scrapetor::TextNode)
    i += 1
  end
  arr
end