Class: Archaeo::AssetExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/archaeo/asset_extractor.rb

Overview

Extracts resource URLs from archived HTML content using Nokogiri.

Parses the HTML DOM to find CSS, JavaScript, images, fonts, and media resources referenced by the page. Optionally resolves relative URLs against a base URL.

Constant Summary collapse

FONT_CDN_PATTERNS =
%w[
  fonts.googleapis.com
  fonts.gstatic.com
  use.typekit.net
  fast.fonts.net
  cloud.typography.com
].freeze
CSS_URL_PATTERN =
/url\(\s*['"]?([^'")\s]+)['"]?\s*\)/
CSS_IMAGE_PROPS =
Regexp.new(
  "(?:background-image|background|list-style-image|content|cursor)" \
  "\\s*:[^;]*#{CSS_URL_PATTERN.source}",
)
PRELOAD_TYPE_MAP =
{
  "style" => :css,
  "script" => :js,
  "image" => :image,
}.freeze

Instance Method Summary collapse

Constructor Details

#initialize(html, base_url: nil) ⇒ AssetExtractor

Returns a new instance of AssetExtractor.



33
34
35
36
# File 'lib/archaeo/asset_extractor.rb', line 33

def initialize(html, base_url: nil)
  @doc = Nokogiri::HTML(html.to_s)
  @base_url = base_url
end

Instance Method Details

#extractObject



38
39
40
41
42
43
44
45
46
47
48
49
# File 'lib/archaeo/asset_extractor.rb', line 38

def extract
  list = AssetList.new
  extract_css(list)
  extract_js(list)
  extract_images(list)
  extract_fonts(list)
  extract_media(list)
  extract_inline_css(list)
  extract_inline_styles(list)
  extract_preloads(list)
  list
end