Module: DurableHuggingfaceHub::FileDownload
- Defined in:
- lib/durable_huggingface_hub/file_download.rb
Overview
File download functionality with caching and ETag support.
This module provides utilities for downloading files from the HuggingFace Hub with intelligent caching, resume support, and validation using ETags.
Constant Summary collapse
- DEFAULT_CACHE_DIR =
Default cache directory location
Pathname.new(Dir.home).join(".cache", "huggingface", "hub")
- METADATA_FILENAME =
Metadata file name for cache entries
".metadata.json"- LOCK_SUFFIX =
Lock file suffix for atomic operations
".lock"
Class Method Summary collapse
-
.copy_snapshot_to_local_dir(snapshot_folder, local_dir_path) ⇒ Object
Copies snapshot directory to local directory.
-
.download_file(repo_id:, filename:, repo_type:, revision:, storage_folder:, force_download:, token:, headers:, progress:) ⇒ Pathname
Downloads a file and stores it in the cache.
-
.download_files_parallel(repo_id:, files:, repo_type:, revision:, cache_dir:, force_download:, token:, max_workers:, progress:) ⇒ Object
Downloads multiple files in parallel using threads.
-
.download_to_blob(client, url_path, blob_path, metadata, progress) ⇒ Object
Downloads a file to blob storage.
-
.ensure_snapshot_link(blob_path, snapshot_path) ⇒ Object
Ensures a symlink exists from snapshot to blob.
-
.extract_etag(etag) ⇒ String?
Extracts clean ETag from header value.
-
.filter_repo_files(files, allow_patterns: nil, ignore_patterns: nil) ⇒ Array<String>
Filters files based on glob patterns.
-
.find_cached_file(storage_folder, filename, revision) ⇒ Pathname?
Finds a cached file for a specific revision.
-
.get_file_metadata(client, url_path) ⇒ Hash
Gets metadata about a file from the Hub.
-
.get_storage_folder(repo_id, repo_type: "model", cache_dir: nil) ⇒ Pathname
Gets the cache directory for a repository.
-
.hf_hub_download(repo_id:, filename:, repo_type: "model", revision: nil, cache_dir: nil, force_download: false, token: nil, local_files_only: false, headers: nil, progress: nil) ⇒ Pathname
Downloads a file from the HuggingFace Hub with caching.
-
.hf_hub_url(repo_id:, filename:, repo_type: "model", revision: "main", endpoint: nil) ⇒ String
Generate the HuggingFace Hub URL for a file in a repository.
-
.resolve_cache_dir(cache_dir) ⇒ Pathname
Resolves the cache directory to use.
-
.snapshot_download(repo_id:, repo_type: "model", revision: nil, cache_dir: nil, local_dir: nil, force_download: false, token: nil, local_files_only: false, allow_patterns: nil, ignore_patterns: nil, max_workers: 8, progress: nil) ⇒ Pathname
Downloads an entire repository snapshot from the HuggingFace Hub with caching.
-
.try_to_load_from_cache(repo_id:, filename:, repo_type: "model", revision: "main", cache_dir: nil) ⇒ Pathname?
Try to load a file from cache without downloading.
-
.update_refs(storage_folder, revision, commit_hash) ⇒ Object
Updates refs to point to the latest commit hash.
-
.verify_blob(blob_path, etag) ⇒ Boolean
Verifies a blob file matches the expected ETag.
-
.write_blob_metadata(blob_path, metadata) ⇒ Object
Writes metadata for a blob file.
Class Method Details
.copy_snapshot_to_local_dir(snapshot_folder, local_dir_path) ⇒ Object
Copies snapshot directory to local directory.
728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 728 def self.copy_snapshot_to_local_dir(snapshot_folder, local_dir_path) return unless snapshot_folder.exist? FileUtils.mkdir_p(local_dir_path) # Copy all files and directories from snapshot to local_dir snapshot_folder.children.each do |entry| dest = local_dir_path.join(entry.basename) if entry.symlink? # For symlinks, copy the actual file content target = entry.readlink target = entry.dirname.join(target) unless target.absolute? if target.file? FileUtils.cp(target, dest) end elsif entry.directory? FileUtils.cp_r(entry, dest) elsif entry.file? FileUtils.cp(entry, dest) end end end |
.download_file(repo_id:, filename:, repo_type:, revision:, storage_folder:, force_download:, token:, headers:, progress:) ⇒ Pathname
Downloads a file and stores it in the cache.
427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 427 def self.download_file( repo_id:, filename:, repo_type:, revision:, storage_folder:, force_download:, token:, headers:, progress: ) # Create HTTP client client = Utils::HttpClient.new(token: token, headers: headers) # Build URL for file url_path = "/#{repo_type}s/#{repo_id}/resolve/#{revision}/#{filename}" # Get metadata about the file (including ETag and commit hash) = (client, url_path) # Determine final storage location commit_hash = [:commit_hash] || revision blob_path = storage_folder.join("blobs", [:etag]) snapshot_path = storage_folder.join("snapshots", commit_hash, filename) # Check if we already have this file (by ETag or snapshot file) unless force_download if blob_path.exist? && verify_blob(blob_path, [:etag]) # File exists in blob storage, create symlink if needed ensure_snapshot_link(blob_path, snapshot_path) update_refs(storage_folder, revision, commit_hash) return snapshot_path elsif snapshot_path.exist? # File exists in snapshot, assume it's valid update_refs(storage_folder, revision, commit_hash) return snapshot_path end end # Use the redirect-resolved URL so the streaming GET never sees a 3xx. download_url = [:resolved_url] || url_path # Download the file to blob storage download_to_blob(client, download_url, blob_path, , progress) # Create snapshot symlink ensure_snapshot_link(blob_path, snapshot_path) # Update refs update_refs(storage_folder, revision, commit_hash) snapshot_path end |
.download_files_parallel(repo_id:, files:, repo_type:, revision:, cache_dir:, force_download:, token:, max_workers:, progress:) ⇒ Object
Downloads multiple files in parallel using threads.
663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 663 def self.download_files_parallel( repo_id:, files:, repo_type:, revision:, cache_dir:, force_download:, token:, max_workers:, progress: ) require "thread" # Create a queue of files to download queue = Queue.new files.each { |file| queue << file } # Track completed downloads completed = 0 total = files.length mutex = Mutex.new # Create worker threads threads = Array.new([max_workers, files.length].min) do Thread.new do loop do file = begin queue.pop(true) rescue ThreadError break # Queue is empty end begin hf_hub_download( repo_id: repo_id, filename: file, repo_type: repo_type, revision: revision, cache_dir: cache_dir, force_download: force_download, token: token, local_files_only: false, progress: nil # Individual file progress not supported in parallel mode ) mutex.synchronize do completed += 1 progress&.call(completed, total, (completed.to_f / total * 100).round(2)) if progress end rescue => e warn "Failed to download #{file}: #{e.}" # Continue with other files end end end end # Wait for all threads to complete threads.each(&:join) end |
.download_to_blob(client, url_path, blob_path, metadata, progress) ⇒ Object
Downloads a file to blob storage.
538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 538 def self.download_to_blob(client, url_path, blob_path, , progress) # Ensure blobs directory exists blob_path.dirname.mkpath # Download to temporary file first (atomic operation) temp_path = Pathname.new("#{blob_path}.tmp.#{Process.pid}") # Create progress tracker progress_tracker = if progress Utils::Progress.new(total: [:size], callback: progress) else Utils::NullProgress.new end begin File.open(temp_path, "wb") do |f| client.request(:get, url_path) do |req| req..on_data = proc do |chunk, _overall_received_bytes, _env| f.write(chunk) progress_tracker.update(chunk.bytesize) end end end # Verify download unless temp_path.exist? && temp_path.size.positive? raise DurableHuggingfaceHubError, "Download failed: file is empty or missing" end # Move to final location atomically FileUtils.mv(temp_path, blob_path) # Mark progress as finished progress_tracker.finish # Write metadata (blob_path, ) ensure # Clean up temp file if it still exists temp_path.unlink if temp_path.exist? end end |
.ensure_snapshot_link(blob_path, snapshot_path) ⇒ Object
Ensures a symlink exists from snapshot to blob.
594 595 596 597 598 599 600 601 602 603 604 605 606 607 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 594 def self.ensure_snapshot_link(blob_path, snapshot_path) # Create snapshot directory if needed snapshot_path.dirname.mkpath # Remove existing file/link if present snapshot_path.unlink if snapshot_path.exist? || snapshot_path.symlink? # Create relative symlink relative_blob_path = blob_path.relative_path_from(snapshot_path.dirname) snapshot_path.make_symlink(relative_blob_path) rescue NotImplementedError # System doesn't support symlinks, copy instead FileUtils.cp(blob_path, snapshot_path) end |
.extract_etag(etag) ⇒ String?
Extracts clean ETag from header value.
506 507 508 509 510 511 512 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 506 def self.extract_etag(etag) return nil unless etag # Remove quotes and W/ prefix etag = etag.gsub(/^W\//, "").gsub(/^"/, "").gsub(/"$/, "") etag.empty? ? nil : etag end |
.filter_repo_files(files, allow_patterns: nil, ignore_patterns: nil) ⇒ Array<String>
Filters files based on glob patterns.
630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 630 def self.filter_repo_files(files, allow_patterns: nil, ignore_patterns: nil) filtered = files # Apply allow_patterns if specified if allow_patterns patterns = Array(allow_patterns) filtered = filtered.select do |filename| patterns.any? { |pattern| File.fnmatch(pattern, filename, File::FNM_PATHNAME) } end end # Apply ignore_patterns if specified if ignore_patterns patterns = Array(ignore_patterns) filtered = filtered.reject do |filename| patterns.any? { |pattern| File.fnmatch(pattern, filename, File::FNM_PATHNAME) } end end filtered end |
.find_cached_file(storage_folder, filename, revision) ⇒ Pathname?
Finds a cached file for a specific revision.
386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 386 def self.find_cached_file(storage_folder, filename, revision) # Look for snapshot folder for this revision snapshots_folder = storage_folder.join("snapshots") return nil unless snapshots_folder.exist? # Try to find by revision folder revision_folder = snapshots_folder.join(revision) if revision_folder.exist? file_path = revision_folder.join(filename) return file_path if file_path.exist? end # Try to find in refs folder refs_folder = storage_folder.join("refs") if refs_folder.exist? ref_file = refs_folder.join(revision) if ref_file.exist? commit_hash = ref_file.read.strip commit_folder = snapshots_folder.join(commit_hash) if commit_folder.exist? file_path = commit_folder.join(filename) return file_path if file_path.exist? end end end nil end |
.get_file_metadata(client, url_path) ⇒ Hash
Gets metadata about a file from the Hub.
486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 486 def self.(client, url_path) response = client.head(url_path) headers = response.headers # After following redirects, env[:url] holds the final resolved URL. # We store it so the subsequent streaming GET can target it directly, # bypassing the redirect entirely (on_data fires below middleware). resolved_url = response.env[:url].to_s { etag: extract_etag(headers["etag"] || headers["x-linked-etag"]), size: headers["x-linked-size"]&.to_i, commit_hash: headers["x-repo-commit"], resolved_url: resolved_url } end |
.get_storage_folder(repo_id, repo_type: "model", cache_dir: nil) ⇒ Pathname
Gets the cache directory for a repository.
349 350 351 352 353 354 355 356 357 358 359 360 361 362 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 349 def self.get_storage_folder(repo_id, repo_type: "model", cache_dir: nil) cache_dir = resolve_cache_dir(cache_dir) # Create a unique folder name based on repo_id and type # Format: models--namespace--name or models--name repo_id_parts = repo_id.split("/") if repo_id_parts.length == 2 folder_name = "#{repo_type}s--#{repo_id_parts[0]}--#{repo_id_parts[1]}" else folder_name = "#{repo_type}s--#{repo_id}" end cache_dir.join(folder_name) end |
.hf_hub_download(repo_id:, filename:, repo_type: "model", revision: nil, cache_dir: nil, force_download: false, token: nil, local_files_only: false, headers: nil, progress: nil) ⇒ Pathname
Downloads a file from the HuggingFace Hub with caching.
This method downloads a file from a HuggingFace Hub repository and caches it locally. It uses ETags to avoid re-downloading unchanged files and supports atomic operations to prevent cache corruption.
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 83 def self.hf_hub_download( repo_id:, filename:, repo_type: "model", revision: nil, cache_dir: nil, force_download: false, token: nil, local_files_only: false, headers: nil, progress: nil ) # Validate inputs repo_id = Utils::Validators.validate_repo_id(repo_id) filename = Utils::Validators.validate_filename(filename) repo_type = Utils::Validators.validate_repo_type(repo_type) revision = Utils::Validators.validate_revision(revision) if revision # Get cache directory cache_dir = resolve_cache_dir(cache_dir) # Build storage paths storage_folder = get_storage_folder(repo_id, repo_type: repo_type, cache_dir: cache_dir) revision ||= "main" # Check if we can use local files only if local_files_only cached_path = find_cached_file(storage_folder, filename, revision) if cached_path return cached_path else raise LocalEntryNotFoundError.new( "File #{filename} not found in local cache for #{repo_id}@#{revision}. " \ "Cannot download because local_files_only=true" ) end end # Get token for authentication token = Utils::Auth.get_token(token: token) # Download or retrieve from cache download_file( repo_id: repo_id, filename: filename, repo_type: repo_type, revision: revision, storage_folder: storage_folder, force_download: force_download, token: token, headers: headers, progress: progress ) end |
.hf_hub_url(repo_id:, filename:, repo_type: "model", revision: "main", endpoint: nil) ⇒ String
Generate the HuggingFace Hub URL for a file in a repository.
821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 821 def self.hf_hub_url( repo_id:, filename:, repo_type: "model", revision: "main", endpoint: nil ) repo_id = Utils::Validators.validate_repo_id(repo_id) filename = Utils::Validators.validate_filename(filename) repo_type = Utils::Validators.validate_repo_type(repo_type) revision = Utils::Validators.validate_revision(revision) endpoint ||= DurableHuggingfaceHub.configuration.endpoint endpoint = endpoint.chomp("/") "#{endpoint}/#{repo_type}s/#{repo_id}/resolve/#{revision}/#{filename}" end |
.resolve_cache_dir(cache_dir) ⇒ Pathname
Resolves the cache directory to use.
368 369 370 371 372 373 374 375 376 377 378 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 368 def self.resolve_cache_dir(cache_dir) if cache_dir Utils::Paths.(cache_dir) elsif ENV["HF_HOME"] Pathname.new(ENV["HF_HOME"]).join("hub") elsif ENV["HUGGINGFACE_HUB_CACHE"] Pathname.new(ENV["HUGGINGFACE_HUB_CACHE"]) else DEFAULT_CACHE_DIR end end |
.snapshot_download(repo_id:, repo_type: "model", revision: nil, cache_dir: nil, local_dir: nil, force_download: false, token: nil, local_files_only: false, allow_patterns: nil, ignore_patterns: nil, max_workers: 8, progress: nil) ⇒ Pathname
Downloads an entire repository snapshot from the HuggingFace Hub with caching.
This method downloads all files from a HuggingFace Hub repository for a given revision and stores them in a local cache directory. It leverages ‘hf_hub_download` for individual file downloads and supports filtering by patterns.
The method implements robust offline fallback: if the Hub is unavailable or network is down, it will try to use locally cached files. It properly handles commit hash resolution for branches and tags.
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 187 def self.snapshot_download( repo_id:, repo_type: "model", revision: nil, cache_dir: nil, local_dir: nil, force_download: false, token: nil, local_files_only: false, allow_patterns: nil, ignore_patterns: nil, max_workers: 8, progress: nil ) # Validate inputs repo_id = Utils::Validators.validate_repo_id(repo_id) repo_type = Utils::Validators.validate_repo_type(repo_type) revision = Utils::Validators.validate_revision(revision) if revision revision ||= "main" # Get cache directory and storage folder cache_dir = resolve_cache_dir(cache_dir) storage_folder = get_storage_folder(repo_id, repo_type: repo_type, cache_dir: cache_dir) # Get token for authentication token = Utils::Auth.get_token(token: token) # Try to fetch repository info from Hub repo_info = nil api_call_error = nil unless local_files_only begin # Initialize HfApi client api = HfApi.new(token: token) repo_info = api.repo_info(repo_id, repo_type: repo_type, revision: revision) rescue StandardError => e # Store error but continue - we might be able to use cached files api_call_error = e end end # If we couldn't get repo_info, try to use cached files if repo_info.nil? # Try to resolve commit hash from revision commit_hash = nil # Check if revision is already a commit hash if revision.match?(/^[0-9a-f]{40}$/) commit_hash = revision else # Try to read commit hash from refs ref_file = storage_folder.join("refs", revision) if ref_file.exist? commit_hash = ref_file.read.strip end end # Try to locate snapshot folder for this commit hash if commit_hash && local_dir.nil? snapshot_folder = storage_folder.join("snapshots", commit_hash) if snapshot_folder.exist? && snapshot_folder.directory? # Snapshot folder exists => return it return snapshot_folder end end # If local_dir is specified and exists, return it if local_dir local_dir_path = Utils::Paths.(local_dir) if local_dir_path.exist? && local_dir_path.directory? && !local_dir_path.children.empty? warn "Returning existing local_dir #{local_dir_path} as remote repo cannot be accessed" return local_dir_path end end # Could not find cached files - raise appropriate error if local_files_only raise LocalEntryNotFoundError.new( "Cannot find an appropriate cached snapshot folder for #{repo_id}@#{revision}. " \ "To enable downloads, set local_files_only=false" ) elsif api_call_error.is_a?(RepositoryNotFoundError) || api_call_error.is_a?(RevisionNotFoundError) raise api_call_error else raise LocalEntryNotFoundError.new( "An error occurred while trying to locate files on the Hub, and we cannot find " \ "the appropriate snapshot folder for #{repo_id}@#{revision} in the local cache. " \ "Please check your internet connection and try again. Error: #{api_call_error&.}" ) end end # At this point, we have repo_info with a valid commit hash commit_hash = repo_info.sha raise DurableHuggingfaceHubError, "Repo info must have a commit SHA" unless commit_hash # Determine snapshot folder snapshot_folder = storage_folder.join("snapshots", commit_hash) # Store ref if revision is not a commit hash if revision != commit_hash update_refs(storage_folder, revision, commit_hash) end # Get list of files from repo_info all_files = if repo_info.respond_to?(:siblings) && repo_info.siblings repo_info.siblings.map { |sibling| sibling[:rfilename] || sibling["rfilename"] }.compact else # Fallback to API call if siblings not available api.list_repo_files(repo_id: repo_id, repo_type: repo_type, revision: commit_hash) end # Filter files based on allow_patterns and ignore_patterns filtered_files = Utils::Paths.filter_repo_objects(all_files, allow_patterns: allow_patterns, ignore_patterns: ignore_patterns) # Download files (with parallelization if max_workers > 1) if max_workers > 1 download_files_parallel( repo_id: repo_id, files: filtered_files, repo_type: repo_type, revision: commit_hash, cache_dir: cache_dir, force_download: force_download, token: token, max_workers: max_workers, progress: progress ) else # Sequential download filtered_files.each do |filename| hf_hub_download( repo_id: repo_id, filename: filename, repo_type: repo_type, revision: commit_hash, cache_dir: cache_dir, force_download: force_download, token: token, local_files_only: false, progress: progress ) end end # If local_dir is specified, copy the snapshot there if local_dir local_dir_path = Utils::Paths.(local_dir) copy_snapshot_to_local_dir(snapshot_folder, local_dir_path) return local_dir_path.realpath end snapshot_folder end |
.try_to_load_from_cache(repo_id:, filename:, repo_type: "model", revision: "main", cache_dir: nil) ⇒ Pathname?
Try to load a file from cache without downloading.
This utility function checks if a file is available in the local cache and returns its path if found. Unlike ‘hf_hub_download` with `local_files_only=true`, this method returns `nil` instead of raising an error when the file is not cached.
777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 777 def self.try_to_load_from_cache( repo_id:, filename:, repo_type: "model", revision: "main", cache_dir: nil ) begin hf_hub_download( repo_id: repo_id, filename: filename, repo_type: repo_type, revision: revision, cache_dir: cache_dir, local_files_only: true ) rescue LocalEntryNotFoundError nil end end |
.update_refs(storage_folder, revision, commit_hash) ⇒ Object
Updates refs to point to the latest commit hash.
614 615 616 617 618 619 620 621 622 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 614 def self.update_refs(storage_folder, revision, commit_hash) return if revision == commit_hash # Don't create ref for commit hashes refs_folder = storage_folder.join("refs") refs_folder.mkpath ref_file = refs_folder.join(revision) ref_file.write(commit_hash) end |
.verify_blob(blob_path, etag) ⇒ Boolean
Verifies a blob file matches the expected ETag.
519 520 521 522 523 524 525 526 527 528 529 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 519 def self.verify_blob(blob_path, etag) return false unless blob_path.exist? # First check filename matches ETag (fast check) return false unless blob_path.basename.to_s == etag # For more robust verification, we could compute the actual ETag # from file content, but for now we trust the filename-based approach # used by HuggingFace Hub true end |
.write_blob_metadata(blob_path, metadata) ⇒ Object
Writes metadata for a blob file.
585 586 587 588 |
# File 'lib/durable_huggingface_hub/file_download.rb', line 585 def self.(blob_path, ) = Pathname.new("#{blob_path}#{METADATA_FILENAME}") .write(JSON.pretty_generate()) end |