Class: ContextDev::Models::WebWebCrawlMdParams
- Inherits:
-
Internal::Type::BaseModel
- Object
- Internal::Type::BaseModel
- ContextDev::Models::WebWebCrawlMdParams
- Extended by:
- Internal::Type::RequestParameters::Converter
- Includes:
- Internal::Type::RequestParameters
- Defined in:
- lib/context_dev/models/web_web_crawl_md_params.rb
Overview
Instance Attribute Summary collapse
-
#follow_subdomains ⇒ Boolean?
When true, follow links on subdomains of the starting URL’s domain (e.g. docs.example.com when starting from example.com).
-
#include_images ⇒ Boolean?
Include image references in the Markdown output.
-
#include_links ⇒ Boolean?
Preserve hyperlinks in the Markdown output.
-
#max_age_ms ⇒ Integer?
Return a cached result if a prior scrape for the same parameters exists and is younger than this many milliseconds.
-
#max_depth ⇒ Integer?
Maximum link depth from the starting URL (0 = only the starting page).
-
#max_pages ⇒ Integer?
Maximum number of pages to crawl.
-
#parse_pdf ⇒ Boolean?
When true (default), PDF pages are fetched and their text layer is extracted and converted to Markdown alongside HTML pages.
-
#shorten_base64_images ⇒ Boolean?
Truncate base64-encoded image data in the Markdown output.
-
#url ⇒ String
The starting URL for the crawl (must include http:// or https:// protocol).
-
#url_regex ⇒ String?
Regex pattern.
-
#use_main_content_only ⇒ Boolean?
Extract only the main content, stripping headers, footers, sidebars, and navigation.
Attributes included from Internal::Type::RequestParameters
Instance Method Summary collapse
-
#initialize(url:, follow_subdomains: nil, include_images: nil, include_links: nil, max_age_ms: nil, max_depth: nil, max_pages: nil, parse_pdf: nil, shorten_base64_images: nil, url_regex: nil, use_main_content_only: nil, request_options: {}) ⇒ Object
constructor
Some parameter documentations has been truncated, see WebWebCrawlMdParams for more details.
Methods included from Internal::Type::RequestParameters::Converter
Methods included from Internal::Type::RequestParameters
Methods inherited from Internal::Type::BaseModel
==, #==, #[], coerce, #deconstruct_keys, #deep_to_h, dump, fields, hash, #hash, inherited, inspect, #inspect, known_fields, optional, recursively_to_h, required, #to_h, #to_json, #to_s, to_sorbet_type, #to_yaml
Methods included from Internal::Type::Converter
#coerce, coerce, #dump, dump, #inspect, inspect, meta_info, new_coerce_state, type_info
Methods included from Internal::Util::SorbetRuntimeSupport
#const_missing, #define_sorbet_constant!, #sorbet_constant_defined?, #to_sorbet_type, to_sorbet_type
Constructor Details
#initialize(url:, follow_subdomains: nil, include_images: nil, include_links: nil, max_age_ms: nil, max_depth: nil, max_pages: nil, parse_pdf: nil, shorten_base64_images: nil, url_regex: nil, use_main_content_only: nil, request_options: {}) ⇒ Object
Some parameter documentations has been truncated, see ContextDev::Models::WebWebCrawlMdParams for more details.
|
|
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 83
|
Instance Attribute Details
#follow_subdomains ⇒ Boolean?
When true, follow links on subdomains of the starting URL’s domain (e.g. docs.example.com when starting from example.com). www and apex are always treated as equivalent.
22 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 22 optional :follow_subdomains, ContextDev::Internal::Type::Boolean, api_name: :followSubdomains |
#include_images ⇒ Boolean?
Include image references in the Markdown output
28 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 28 optional :include_images, ContextDev::Internal::Type::Boolean, api_name: :includeImages |
#include_links ⇒ Boolean?
Preserve hyperlinks in the Markdown output
34 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 34 optional :include_links, ContextDev::Internal::Type::Boolean, api_name: :includeLinks |
#max_age_ms ⇒ Integer?
Return a cached result if a prior scrape for the same parameters exists and is younger than this many milliseconds. Defaults to 1 day (86400000 ms) when omitted. Max is 30 days (2592000000 ms). Set to 0 to always scrape fresh.
42 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 42 optional :max_age_ms, Integer, api_name: :maxAgeMs |
#max_depth ⇒ Integer?
Maximum link depth from the starting URL (0 = only the starting page)
48 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 48 optional :max_depth, Integer, api_name: :maxDepth |
#max_pages ⇒ Integer?
Maximum number of pages to crawl. Hard cap: 500.
54 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 54 optional :max_pages, Integer, api_name: :maxPages |
#parse_pdf ⇒ Boolean?
When true (default), PDF pages are fetched and their text layer is extracted and converted to Markdown alongside HTML pages. When false, PDF pages are skipped entirely (not included in results and not counted as failures).
62 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 62 optional :parse_pdf, ContextDev::Internal::Type::Boolean, api_name: :parsePDF |
#shorten_base64_images ⇒ Boolean?
Truncate base64-encoded image data in the Markdown output
68 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 68 optional :shorten_base64_images, ContextDev::Internal::Type::Boolean, api_name: :shortenBase64Images |
#url ⇒ String
The starting URL for the crawl (must include http:// or https:// protocol)
14 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 14 required :url, String |
#url_regex ⇒ String?
Regex pattern. Only URLs matching this pattern will be followed and scraped.
74 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 74 optional :url_regex, String, api_name: :urlRegex |
#use_main_content_only ⇒ Boolean?
Extract only the main content, stripping headers, footers, sidebars, and navigation
81 |
# File 'lib/context_dev/models/web_web_crawl_md_params.rb', line 81 optional :use_main_content_only, ContextDev::Internal::Type::Boolean, api_name: :useMainContentOnly |