Class: ContextDev::Models::WebWebCrawlMdParams
- Inherits:
-
Internal::Type::BaseModel
- Object
- Internal::Type::BaseModel
- ContextDev::Models::WebWebCrawlMdParams
- Extended by:
- Internal::Type::RequestParameters::Converter
- Includes:
- Internal::Type::RequestParameters
- Defined in:
- lib/context_dev/models/web_web_crawl_md_params.rb
Overview
Defined Under Namespace
Instance Attribute Summary collapse
-
#country ⇒ Symbol, ...
Two-letter ISO 3166-1 alpha-2 country code identifying a supported Context.dev residential proxy exit location.
-
#exclude_selectors ⇒ Array<String>?
CSS selectors to remove before each crawled page is converted to Markdown.
-
#follow_subdomains ⇒ Boolean?
When true, follow links on subdomains of the starting URL’s domain (e.g. docs.example.com when starting from example.com).
-
#include_frames ⇒ Boolean?
When true, the contents of iframes are rendered to Markdown for each crawled page.
-
#include_images ⇒ Boolean?
Include image references in the Markdown output.
-
#include_links ⇒ Boolean?
Preserve hyperlinks in the Markdown output.
-
#include_selectors ⇒ Array<String>?
CSS selectors.
-
#max_age_ms ⇒ Integer?
Return a cached result if a prior scrape for the same parameters exists and is younger than this many milliseconds.
-
#max_depth ⇒ Integer?
Maximum link depth from the starting URL (0 = only the starting page).
-
#max_pages ⇒ Integer?
Maximum number of pages to crawl.
-
#pdf ⇒ ContextDev::Models::WebWebCrawlMdParams::Pdf?
PDF parsing controls.
-
#shorten_base64_images ⇒ Boolean?
Truncate base64-encoded image data in the Markdown output.
-
#stop_after_ms ⇒ Integer?
Soft time budget for the crawl in milliseconds.
-
#timeout_ms ⇒ Integer?
Optional timeout in milliseconds for the request.
-
#url ⇒ String
The starting URL for the crawl (must include http:// or https:// protocol).
-
#url_regex ⇒ String?
Regex pattern.
-
#use_main_content_only ⇒ Boolean?
Extract only the main content, stripping headers, footers, sidebars, and navigation.
-
#wait_for_ms ⇒ Integer?
Optional browser wait time in milliseconds after initial page load for each crawled page.
Attributes included from Internal::Type::RequestParameters
Class Method Summary collapse
Instance Method Summary collapse
-
#initialize(url:, country: nil, exclude_selectors: nil, follow_subdomains: nil, include_frames: nil, include_images: nil, include_links: nil, include_selectors: nil, max_age_ms: nil, max_depth: nil, max_pages: nil, pdf: nil, shorten_base64_images: nil, stop_after_ms: nil, timeout_ms: nil, url_regex: nil, use_main_content_only: nil, wait_for_ms: nil, request_options: {}) ⇒ Object
constructor
Some parameter documentations has been truncated, see WebWebCrawlMdParams for more details.
Methods included from Internal::Type::RequestParameters::Converter
Methods included from Internal::Type::RequestParameters
Methods inherited from Internal::Type::BaseModel
==, #==, #[], coerce, #deconstruct_keys, #deep_to_h, dump, fields, hash, #hash, inherited, inspect, #inspect, known_fields, optional, recursively_to_h, required, #to_h, #to_json, #to_s, to_sorbet_type, #to_yaml
Methods included from Internal::Type::Converter
#coerce, coerce, #dump, dump, #inspect, inspect, meta_info, new_coerce_state, type_info
Methods included from Internal::Util::SorbetRuntimeSupport
#const_missing, #define_sorbet_constant!, #sorbet_constant_defined?, #to_sorbet_type, to_sorbet_type
Constructor Details
#initialize(url:, country: nil, exclude_selectors: nil, follow_subdomains: nil, include_frames: nil, include_images: nil, include_links: nil, include_selectors: nil, max_age_ms: nil, max_depth: nil, max_pages: nil, pdf: nil, shorten_base64_images: nil, stop_after_ms: nil, timeout_ms: nil, url_regex: nil, use_main_content_only: nil, wait_for_ms: nil, request_options: {}) ⇒ Object
Some parameter documentations has been truncated, see ContextDev::Models::WebWebCrawlMdParams for more details.
|
|
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 138
|
Instance Attribute Details
#country ⇒ Symbol, ...
Two-letter ISO 3166-1 alpha-2 country code identifying a supported Context.dev residential proxy exit location. Must be one of Context.dev’s supported countries. When provided, Context.dev fetches the target page from that country.
22 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 22 optional :country, enum: -> { ContextDev::WebWebCrawlMdParams::Country } |
#exclude_selectors ⇒ Array<String>?
CSS selectors to remove before each crawled page is converted to Markdown. Applied after includeSelectors. Exclusion takes precedence: an element matching both is removed. Examples: “nav”, “footer”, “.ad-banner”, “[aria-hidden=true]”.
30 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 30 optional :exclude_selectors, ContextDev::Internal::Type::ArrayOf[String], api_name: :excludeSelectors |
#follow_subdomains ⇒ Boolean?
When true, follow links on subdomains of the starting URL’s domain (e.g. docs.example.com when starting from example.com). www and apex are always treated as equivalent.
38 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 38 optional :follow_subdomains, ContextDev::Internal::Type::Boolean, api_name: :followSubdomains |
#include_frames ⇒ Boolean?
When true, the contents of iframes are rendered to Markdown for each crawled page.
45 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 45 optional :include_frames, ContextDev::Internal::Type::Boolean, api_name: :includeFrames |
#include_images ⇒ Boolean?
Include image references in the Markdown output
51 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 51 optional :include_images, ContextDev::Internal::Type::Boolean, api_name: :includeImages |
#include_links ⇒ Boolean?
Preserve hyperlinks in the Markdown output
57 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 57 optional :include_links, ContextDev::Internal::Type::Boolean, api_name: :includeLinks |
#include_selectors ⇒ Array<String>?
CSS selectors. When provided, only matching HTML subtrees (and their descendants) are kept before each crawled page is converted to Markdown. When omitted, the entire document is kept. Examples: “article.main”, “#content”, “[role=main]”.
66 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 66 optional :include_selectors, ContextDev::Internal::Type::ArrayOf[String], api_name: :includeSelectors |
#max_age_ms ⇒ Integer?
Return a cached result if a prior scrape for the same parameters exists and is younger than this many milliseconds. Defaults to 1 day (86400000 ms) when omitted. Max is 30 days (2592000000 ms). Set to 0 to always scrape fresh.
74 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 74 optional :max_age_ms, Integer, api_name: :maxAgeMs |
#max_depth ⇒ Integer?
Maximum link depth from the starting URL (0 = only the starting page)
80 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 80 optional :max_depth, Integer, api_name: :maxDepth |
#max_pages ⇒ Integer?
Maximum number of pages to crawl. Hard cap: 500.
86 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 86 optional :max_pages, Integer, api_name: :maxPages |
#pdf ⇒ ContextDev::Models::WebWebCrawlMdParams::Pdf?
PDF parsing controls. Use start/end to limit text extraction and OCR to an inclusive 1-based page range.
93 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 93 optional :pdf, -> { ContextDev::WebWebCrawlMdParams::Pdf } |
#shorten_base64_images ⇒ Boolean?
Truncate base64-encoded image data in the Markdown output
99 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 99 optional :shorten_base64_images, ContextDev::Internal::Type::Boolean, api_name: :shortenBase64Images |
#stop_after_ms ⇒ Integer?
Soft time budget for the crawl in milliseconds. After each scrape, the crawler checks the elapsed time and, if exceeded, returns the pages collected so far instead of continuing. Min: 10000 (10s). Max: 110000 (110s). Default: 80000 (80s).
108 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 108 optional :stop_after_ms, Integer, api_name: :stopAfterMs |
#timeout_ms ⇒ Integer?
Optional timeout in milliseconds for the request. If the request takes longer than this value, it will be aborted with a 408 status code. Maximum allowed value is 300000ms (5 minutes).
116 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 116 optional :timeout_ms, Integer, api_name: :timeoutMS |
#url ⇒ String
The starting URL for the crawl (must include http:// or https:// protocol)
14 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 14 required :url, String |
#url_regex ⇒ String?
Regex pattern. Only URLs matching this pattern will be followed and scraped.
122 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 122 optional :url_regex, String, api_name: :urlRegex |
#use_main_content_only ⇒ Boolean?
Extract only the main content, stripping headers, footers, sidebars, and navigation
129 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 129 optional :use_main_content_only, ContextDev::Internal::Type::Boolean, api_name: :useMainContentOnly |
#wait_for_ms ⇒ Integer?
Optional browser wait time in milliseconds after initial page load for each crawled page. Min: 0. Max: 30000 (30 seconds).
136 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 136 optional :wait_for_ms, Integer, api_name: :waitForMs |
Class Method Details
.values ⇒ Array<Symbol>
|
|
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 391
|