Class: Archaeo::AssetExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/archaeo/asset_extractor.rb

Overview

Extracts resource URLs from archived HTML content using Nokogiri.

Parses the HTML DOM to find CSS, JavaScript, images, fonts, and media resources referenced by the page. Optionally resolves relative URLs against a base URL.

Instance Method Summary collapse

Constructor Details

#initialize(html, base_url: nil) ⇒ AssetExtractor

Returns a new instance of AssetExtractor.



13
14
15
16
# File 'lib/archaeo/asset_extractor.rb', line 13

def initialize(html, base_url: nil)
  @doc = Nokogiri::HTML(html.to_s)
  @base_url = base_url
end

Instance Method Details

#extractObject



18
19
20
21
22
23
24
25
26
27
# File 'lib/archaeo/asset_extractor.rb', line 18

def extract
  list = AssetList.new
  extract_css(list)
  extract_js(list)
  extract_images(list)
  extract_fonts(list)
  extract_media(list)
  extract_inline_css(list)
  list
end