Class: Archaeo::AssetExtractor
- Inherits:
-
Object
- Object
- Archaeo::AssetExtractor
- Defined in:
- lib/archaeo/asset_extractor.rb
Overview
Extracts resource URLs from archived HTML content using Nokogiri.
Parses the HTML DOM to find CSS, JavaScript, images, fonts, and media resources referenced by the page. Optionally resolves relative URLs against a base URL.
Constant Summary collapse
- FONT_CDN_PATTERNS =
%w[ fonts.googleapis.com fonts.gstatic.com use.typekit.net fast.fonts.net cloud.typography.com ].freeze
- CSS_URL_PATTERN =
/url\(\s*['"]?([^'")\s]+)['"]?\s*\)/- CSS_IMAGE_PROPS =
Regexp.new( "(?:background-image|background|list-style-image|content|cursor)" \ "\\s*:[^;]*#{CSS_URL_PATTERN.source}", )
- PRELOAD_TYPE_MAP =
{ "style" => :css, "script" => :js, "image" => :image, }.freeze
Instance Method Summary collapse
- #extract ⇒ Object
-
#initialize(html, base_url: nil) ⇒ AssetExtractor
constructor
A new instance of AssetExtractor.
Constructor Details
#initialize(html, base_url: nil) ⇒ AssetExtractor
Returns a new instance of AssetExtractor.
33 34 35 36 |
# File 'lib/archaeo/asset_extractor.rb', line 33 def initialize(html, base_url: nil) @doc = Nokogiri::HTML(html.to_s) @base_url = base_url end |
Instance Method Details
#extract ⇒ Object
38 39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/archaeo/asset_extractor.rb', line 38 def extract list = AssetList.new extract_css(list) extract_js(list) extract_images(list) extract_fonts(list) extract_media(list) extract_inline_css(list) extract_inline_styles(list) extract_preloads(list) list end |