Module: Pikuri::Tool::Search::DuckDuckGo
- Defined in:
- lib/pikuri/tool/search/duckduckgo.rb
Overview
Performs a DuckDuckGo search by scraping html.duckduckgo.com and returns the hits as a list of Result rows. Split into a thin HTTP fetch (#search) and a pure parser (#parse) so tests can exercise the parser against fixture HTML without hitting the network. The cascade in Engines.search owns the final Markdown rendering.
Privacy posture
DuckDuckGo’s privacy policy states We don’t save your IP address or any unique identifiers alongside your searches and We have never sold any personal information, and they proxy requests on the user’s behalf so downstream content providers can’t build a per-user search history. That part is real — but DDG is mainly a relay over Bing for web results, so the *query content* still reaches Microsoft for fulfillment even though DDG strips identifying info on the way out.
Bottom line: DDG is a genuine privacy improvement over hitting Bing directly (your IP isn’t tied to the query, no per-user profile is built on DDG’s side), but query content still lands at Microsoft, who has no comparable no-training pledge. Better than Exa for sensitive queries, worse than Brave; for anything genuinely embarrassing, don’t search the web at all.
Constant Summary collapse
- ENDPOINT =
Returns HTML search endpoint.
'https://html.duckduckgo.com/html/'- USER_AGENT =
Returns User-Agent sent with each request; DDG often rejects requests with no UA or an obvious bot UA.
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 ' \ '(KHTML, like Gecko) Chrome/120.0 Safari/537.36'
- DEFAULT_MAX_RESULTS =
Returns default number of results returned, matching smolagents.
10- LIMITER =
Returns paces calls (DDG bans IPs that hammer the HTML endpoint) and circuit-breaks on Engines::Unavailable so a soft-block response doesn’t get retried for the next 5 minutes.
RateLimiter.new(min_interval: 2.0, cooldown: 300.0)
Class Method Summary collapse
-
.extract_url(href) ⇒ String
Decode DuckDuckGo’s //duckduckgo.com/l/?uddg=<encoded> redirect wrapper back to the real target URL.
-
.parse(html, max_results: DEFAULT_MAX_RESULTS) ⇒ Array<Result>
Parse a
html.duckduckgo.comresult page into a list of Result rows. -
.search(query, max_results: DEFAULT_MAX_RESULTS) ⇒ Array<Result>
Fetch results for
queryand return them as an Array<Result>.
Class Method Details
.extract_url(href) ⇒ String
Decode DuckDuckGo’s //duckduckgo.com/l/?uddg=<encoded> redirect wrapper back to the real target URL.
182 183 184 185 186 187 188 189 190 191 192 |
# File 'lib/pikuri/tool/search/duckduckgo.rb', line 182 def self.extract_url(href) return href if href.nil? || href.empty? uri = URI.parse(href.start_with?('//') ? "https:#{href}" : href) return href unless uri.host&.end_with?('duckduckgo.com') && uri.path == '/l/' params = URI.decode_www_form(uri.query.to_s).to_h params['uddg'] || href rescue URI::InvalidURIError href end |
.parse(html, max_results: DEFAULT_MAX_RESULTS) ⇒ Array<Result>
Parse a html.duckduckgo.com result page into a list of Result rows. <b> highlights inside snippets are stripped.
When the page has zero result nodes, two cases are distinguished: a genuine “no results” page (narrow query, DDG’s own “No results found” indicator) returns an empty array instead of raising, so Engines.search can render its standard no-results stub. Anything else (anomaly modal, CAPTCHA, service-unavailable page, unknown layout) raises with the diagnostic text extracted from the body, so an IP soft-block is surfaced rather than silently masquerading as an empty search.
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
# File 'lib/pikuri/tool/search/duckduckgo.rb', line 101 def self.parse(html, max_results: DEFAULT_MAX_RESULTS) doc = Nokogiri::HTML(html) results = doc.css('div.result.web-result').take(max_results).filter_map do |node| title_link = node.at_css('a.result__a') next nil if title_link.nil? snippet = node.at_css('a.result__snippet') Result.new( url: extract_url(title_link['href']), title: title_link.text.strip, body: snippet&.text&.strip.to_s ) end if results.empty? return [] if genuine_no_results?(doc) = diagnose_empty(doc) raise(anomaly_modal?(doc) ? Engines::Unavailable : RuntimeError, ) end results end |
.search(query, max_results: DEFAULT_MAX_RESULTS) ⇒ Array<Result>
Fetch results for query and return them as an Array<Result>. Calls are throttled to one every 2s and circuit-broken for 5 minutes after a soft-block; see LIMITER. The caller (typically Engines.search) is expected to have already normalized the query and to wrap this in a result cache.
64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
# File 'lib/pikuri/tool/search/duckduckgo.rb', line 64 def self.search(query, max_results: DEFAULT_MAX_RESULTS) LIMITER.call do response = Faraday.get(ENDPOINT, { q: query }, { 'User-Agent' => USER_AGENT }) unless response.success? if response.status == 429 || response.status >= 500 raise Engines::Unavailable, "HTTP #{response.status}" end raise "DuckDuckGo request failed: #{response.status} #{response.body}" end parse(response.body, max_results: max_results) end end |