Class: Archaeo::AssetExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/archaeo/asset_extractor.rb

Overview

Extracts resource URLs from archived HTML content using Nokogiri.

Parses the HTML DOM to find CSS, JavaScript, images, fonts, and media resources referenced by the page. Optionally resolves relative URLs against a base URL.

Constant Summary collapse

FONT_CDN_PATTERNS =
%w[
  fonts.googleapis.com
  fonts.gstatic.com
  use.typekit.net
  fast.fonts.net
  cloud.typography.com
].freeze
CSS_URL_PATTERN =
/url\(\s*['"]?([^'")\s]+)['"]?\s*\)/
CSS_IMAGE_PROPS =
Regexp.new(
  "(?:background-image|background|list-style-image|content|cursor)" \
  "\\s*:[^;]*#{CSS_URL_PATTERN.source}",
)

Instance Method Summary collapse

Constructor Details

#initialize(html, base_url: nil) ⇒ AssetExtractor

Returns a new instance of AssetExtractor.



27
28
29
30
# File 'lib/archaeo/asset_extractor.rb', line 27

def initialize(html, base_url: nil)
  @doc = Nokogiri::HTML(html.to_s)
  @base_url = base_url
end

Instance Method Details

#extractObject



32
33
34
35
36
37
38
39
40
41
42
# File 'lib/archaeo/asset_extractor.rb', line 32

def extract
  list = AssetList.new
  extract_css(list)
  extract_js(list)
  extract_images(list)
  extract_fonts(list)
  extract_media(list)
  extract_inline_css(list)
  extract_inline_styles(list)
  list
end