Module: Pikuri::Tool::Search::DuckDuckGo

Defined in:
lib/pikuri/tool/search/duckduckgo.rb

Overview

Performs a DuckDuckGo search by scraping html.duckduckgo.com and returns the hits as a list of Result rows. Split into a thin HTTP fetch (#search) and a pure parser (#parse) so tests can exercise the parser against fixture HTML without hitting the network. The cascade in Engines.search owns the final Markdown rendering.

Privacy posture

DuckDuckGo’s privacy policy states We don’t save your IP address or any unique identifiers alongside your searches and We have never sold any personal information, and they proxy requests on the user’s behalf so downstream content providers can’t build a per-user search history. That part is real — but DDG is mainly a relay over Bing for web results, so the *query content* still reaches Microsoft for fulfillment even though DDG strips identifying info on the way out.

Bottom line: DDG is a genuine privacy improvement over hitting Bing directly (your IP isn’t tied to the query, no per-user profile is built on DDG’s side), but query content still lands at Microsoft, who has no comparable no-training pledge. Better than Exa for sensitive queries, worse than Brave; for anything genuinely embarrassing, don’t search the web at all.

Constant Summary collapse

ENDPOINT =

Returns HTML search endpoint.

Returns:

  • (String)

    HTML search endpoint

'https://html.duckduckgo.com/html/'
USER_AGENT =

Returns User-Agent sent with each request; DDG often rejects requests with no UA or an obvious bot UA.

Returns:

  • (String)

    User-Agent sent with each request; DDG often rejects requests with no UA or an obvious bot UA

'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 ' \
'(KHTML, like Gecko) Chrome/120.0 Safari/537.36'
DEFAULT_MAX_RESULTS =

Returns default number of results returned, matching smolagents.

Returns:

  • (Integer)

    default number of results returned, matching smolagents

10
LIMITER =

Returns paces calls (DDG bans IPs that hammer the HTML endpoint) and circuit-breaks on Engines::Unavailable so a soft-block response doesn’t get retried for the next 5 minutes.

Returns:

  • (RateLimiter)

    paces calls (DDG bans IPs that hammer the HTML endpoint) and circuit-breaks on Engines::Unavailable so a soft-block response doesn’t get retried for the next 5 minutes

RateLimiter.new(min_interval: 2.0, cooldown: 300.0)

Class Method Summary collapse

Class Method Details

.extract_url(href) ⇒ String

Decode DuckDuckGo’s //duckduckgo.com/l/?uddg=<encoded> redirect wrapper back to the real target URL.

Parameters:

  • href (String, nil)

    href as found on the search-result page

Returns:

  • (String)

    the decoded target URL, or href unchanged when it is not a recognized DDG redirect or cannot be parsed



182
183
184
185
186
187
188
189
190
191
192
# File 'lib/pikuri/tool/search/duckduckgo.rb', line 182

def self.extract_url(href)
  return href if href.nil? || href.empty?

  uri = URI.parse(href.start_with?('//') ? "https:#{href}" : href)
  return href unless uri.host&.end_with?('duckduckgo.com') && uri.path == '/l/'

  params = URI.decode_www_form(uri.query.to_s).to_h
  params['uddg'] || href
rescue URI::InvalidURIError
  href
end

.parse(html, max_results: DEFAULT_MAX_RESULTS) ⇒ Array<Result>

Parse a html.duckduckgo.com result page into a list of Result rows. <b> highlights inside snippets are stripped.

When the page has zero result nodes, two cases are distinguished: a genuine “no results” page (narrow query, DDG’s own “No results found” indicator) returns an empty array instead of raising, so Engines.search can render its standard no-results stub. Anything else (anomaly modal, CAPTCHA, service-unavailable page, unknown layout) raises with the diagnostic text extracted from the body, so an IP soft-block is surfaced rather than silently masquerading as an empty search.

Parameters:

  • html (String)

    HTML document body from html.duckduckgo.com

  • max_results (Integer) (defaults to: DEFAULT_MAX_RESULTS)

    maximum number of result entries

Returns:

  • (Array<Result>)

    hits, possibly empty on a genuine no-results page

Raises:

  • (Engines::Unavailable)

    when the page is the DDG anomaly/CAPTCHA modal (IP soft-block) — a “try again later” the cascade can fall back from.

  • (RuntimeError)

    when the page contains no result nodes and is not recognized as either a genuine no-results page or the anomaly modal (likely a layout change worth surfacing loudly).



101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# File 'lib/pikuri/tool/search/duckduckgo.rb', line 101

def self.parse(html, max_results: DEFAULT_MAX_RESULTS)
  doc = Nokogiri::HTML(html)
  results = doc.css('div.result.web-result').take(max_results).filter_map do |node|
    title_link = node.at_css('a.result__a')
    next nil if title_link.nil?

    snippet = node.at_css('a.result__snippet')
    Result.new(
      url: extract_url(title_link['href']),
      title: title_link.text.strip,
      body: snippet&.text&.strip.to_s
    )
  end

  if results.empty?
    return [] if genuine_no_results?(doc)

    message = diagnose_empty(doc)
    raise(anomaly_modal?(doc) ? Engines::Unavailable : RuntimeError, message)
  end

  results
end

.search(query, max_results: DEFAULT_MAX_RESULTS) ⇒ Array<Result>

Fetch results for query and return them as an Array<Result>. Calls are throttled to one every 2s and circuit-broken for 5 minutes after a soft-block; see LIMITER. The caller (typically Engines.search) is expected to have already normalized the query and to wrap this in a result cache.

Parameters:

  • query (String)

    search query (already normalized)

  • max_results (Integer) (defaults to: DEFAULT_MAX_RESULTS)

    maximum number of result entries

Returns:

  • (Array<Result>)

    hits, possibly empty when DDG ran the query and matched nothing

Raises:

  • (Engines::Unavailable)

    when DDG soft-blocks the IP (anomaly/CAPTCHA page) or returns HTTP 429/5xx — i.e. “try again later” responses the cascade in Engines.search can fall back from. Also raised immediately if LIMITER is in cooldown.

  • (RuntimeError)

    if the HTTP call fails for other reasons or the empty-results page is in an unrecognized layout. A genuine empty-results page is not an error; see parse.



64
65
66
67
68
69
70
71
72
73
74
75
76
77
# File 'lib/pikuri/tool/search/duckduckgo.rb', line 64

def self.search(query, max_results: DEFAULT_MAX_RESULTS)
  LIMITER.call do
    response = Faraday.get(ENDPOINT, { q: query }, { 'User-Agent' => USER_AGENT })
    unless response.success?
      if response.status == 429 || response.status >= 500
        raise Engines::Unavailable, "HTTP #{response.status}"
      end

      raise "DuckDuckGo request failed: #{response.status} #{response.body}"
    end

    parse(response.body, max_results: max_results)
  end
end