Class: ContextDev::Models::WebWebCrawlMdParams
- Inherits:
-
Internal::Type::BaseModel
- Object
- Internal::Type::BaseModel
- ContextDev::Models::WebWebCrawlMdParams
- Extended by:
- Internal::Type::RequestParameters::Converter
- Includes:
- Internal::Type::RequestParameters
- Defined in:
- lib/context_dev/models/web_web_crawl_md_params.rb
Overview
Instance Attribute Summary collapse
-
#follow_subdomains ⇒ Boolean?
When true, follow links on subdomains of the starting URL’s domain (e.g. docs.example.com when starting from example.com).
-
#include_frames ⇒ Boolean?
When true, the contents of iframes are rendered to Markdown for each crawled page.
-
#include_images ⇒ Boolean?
Include image references in the Markdown output.
-
#include_links ⇒ Boolean?
Preserve hyperlinks in the Markdown output.
-
#max_age_ms ⇒ Integer?
Return a cached result if a prior scrape for the same parameters exists and is younger than this many milliseconds.
-
#max_depth ⇒ Integer?
Maximum link depth from the starting URL (0 = only the starting page).
-
#max_pages ⇒ Integer?
Maximum number of pages to crawl.
-
#parse_pdf ⇒ Boolean?
When true (default), PDF pages are fetched and their text layer is extracted and converted to Markdown alongside HTML pages.
-
#shorten_base64_images ⇒ Boolean?
Truncate base64-encoded image data in the Markdown output.
-
#url ⇒ String
The starting URL for the crawl (must include http:// or https:// protocol).
-
#url_regex ⇒ String?
Regex pattern.
-
#use_main_content_only ⇒ Boolean?
Extract only the main content, stripping headers, footers, sidebars, and navigation.
Attributes included from Internal::Type::RequestParameters
Instance Method Summary collapse
-
#initialize(url:, follow_subdomains: nil, include_frames: nil, include_images: nil, include_links: nil, max_age_ms: nil, max_depth: nil, max_pages: nil, parse_pdf: nil, shorten_base64_images: nil, url_regex: nil, use_main_content_only: nil, request_options: {}) ⇒ Object
constructor
Some parameter documentations has been truncated, see WebWebCrawlMdParams for more details.
Methods included from Internal::Type::RequestParameters::Converter
Methods included from Internal::Type::RequestParameters
Methods inherited from Internal::Type::BaseModel
==, #==, #[], coerce, #deconstruct_keys, #deep_to_h, dump, fields, hash, #hash, inherited, inspect, #inspect, known_fields, optional, recursively_to_h, required, #to_h, #to_json, #to_s, to_sorbet_type, #to_yaml
Methods included from Internal::Type::Converter
#coerce, coerce, #dump, dump, #inspect, inspect, meta_info, new_coerce_state, type_info
Methods included from Internal::Util::SorbetRuntimeSupport
#const_missing, #define_sorbet_constant!, #sorbet_constant_defined?, #to_sorbet_type, to_sorbet_type
Constructor Details
#initialize(url:, follow_subdomains: nil, include_frames: nil, include_images: nil, include_links: nil, max_age_ms: nil, max_depth: nil, max_pages: nil, parse_pdf: nil, shorten_base64_images: nil, url_regex: nil, use_main_content_only: nil, request_options: {}) ⇒ Object
Some parameter documentations has been truncated, see ContextDev::Models::WebWebCrawlMdParams for more details.
|
|
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 90
|
Instance Attribute Details
#follow_subdomains ⇒ Boolean?
When true, follow links on subdomains of the starting URL’s domain (e.g. docs.example.com when starting from example.com). www and apex are always treated as equivalent.
22 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 22 optional :follow_subdomains, ContextDev::Internal::Type::Boolean, api_name: :followSubdomains |
#include_frames ⇒ Boolean?
When true, the contents of iframes are rendered to Markdown for each crawled page.
29 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 29 optional :include_frames, ContextDev::Internal::Type::Boolean, api_name: :includeFrames |
#include_images ⇒ Boolean?
Include image references in the Markdown output
35 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 35 optional :include_images, ContextDev::Internal::Type::Boolean, api_name: :includeImages |
#include_links ⇒ Boolean?
Preserve hyperlinks in the Markdown output
41 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 41 optional :include_links, ContextDev::Internal::Type::Boolean, api_name: :includeLinks |
#max_age_ms ⇒ Integer?
Return a cached result if a prior scrape for the same parameters exists and is younger than this many milliseconds. Defaults to 1 day (86400000 ms) when omitted. Max is 30 days (2592000000 ms). Set to 0 to always scrape fresh.
49 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 49 optional :max_age_ms, Integer, api_name: :maxAgeMs |
#max_depth ⇒ Integer?
Maximum link depth from the starting URL (0 = only the starting page)
55 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 55 optional :max_depth, Integer, api_name: :maxDepth |
#max_pages ⇒ Integer?
Maximum number of pages to crawl. Hard cap: 500.
61 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 61 optional :max_pages, Integer, api_name: :maxPages |
#parse_pdf ⇒ Boolean?
When true (default), PDF pages are fetched and their text layer is extracted and converted to Markdown alongside HTML pages. When false, PDF pages are skipped entirely (not included in results and not counted as failures).
69 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 69 optional :parse_pdf, ContextDev::Internal::Type::Boolean, api_name: :parsePDF |
#shorten_base64_images ⇒ Boolean?
Truncate base64-encoded image data in the Markdown output
75 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 75 optional :shorten_base64_images, ContextDev::Internal::Type::Boolean, api_name: :shortenBase64Images |
#url ⇒ String
The starting URL for the crawl (must include http:// or https:// protocol)
14 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 14 required :url, String |
#url_regex ⇒ String?
Regex pattern. Only URLs matching this pattern will be followed and scraped.
81 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 81 optional :url_regex, String, api_name: :urlRegex |
#use_main_content_only ⇒ Boolean?
Extract only the main content, stripping headers, footers, sidebars, and navigation
88 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 88 optional :use_main_content_only, ContextDev::Internal::Type::Boolean, api_name: :useMainContentOnly |