warc-swift

main

A swift package to convert a page/site into a WARC web archive
jaredhowland/warc-swift

warc-swift

Swift 6.2 License: MIT

A production-ready Swift 6.2 library for creating WARC 1.1 web archives. Powered by wget-at (ArchiveTeam's wget-lua) for crawling.

Features

  • Full WARC 1.1 compliance — all 8 record types, all named fields, correct CRLF framing
  • Single-page capturearchive(url:options:) fetches a page and all its inline assets
  • Recursive crawlingcrawl(url:options:) with configurable depth, domain filters, rate limits
  • wget-at integration — robots.txt, deduplication, Lua scripting, battle-tested crawling
  • Per-record GZIP.warc.gz output following WARC 1.1 Annex D best practice
  • Swift WARC I/O — read and write WARC files independently with WARCReader / WARCWriter
  • Async/await — Swift 6.2 strict concurrency, WARCArchiver is a safe actor

Requirements

  • Swift 6.2+
  • macOS 13+ or Linux
  • wget or wget-at: see Installation
  • zlib: brew install zlib (macOS) / apt-get install zlib1g-dev (Linux)

Installation

Swift Package Manager

.package(url: "https://github.com/yourorg/warc-swift", from: "1.0.0")

Add "warc-swift" to your target dependencies.

wget Installation

warc-swift delegates all HTTP crawling to wget (or the enhanced wget-at from ArchiveTeam).

macOS (standard wget via Homebrew):

brew install wget

Ubuntu/Debian:

apt-get install wget

Build ArchiveTeam's wget-lua (adds Lua scripting, advanced WARC options):

chmod +x Scripts/build-wget-at.sh
./Scripts/build-wget-at.sh

This builds GNU Wget 1.21.3-at with +ssl/openssl +lua/luajit +psl and installs it into Sources/warc-swift/Resources/Binaries/wget-at-darwin-arm64 (macOS arm64).

Note: The bundled wget-at binary links against Homebrew dylibs (openssl@3, luajit, libpsl). Install them with: brew install openssl@3 luajit libpsl

Bundle into the package (fallback — copies the system wget):**

chmod +x Scripts/fetch-binaries.sh
./Scripts/fetch-binaries.sh

Quick Start

import warc_swift

let archiver = WARCArchiver()

// Archive a single page (fetches page + all inline assets: images, CSS, JS)
let warcURL = try await archiver.archive(
    url: URL(string: "https://example.com")!,
    options: ArchiveOptions(outputPath: URL(fileURLWithPath: "output/"))
)
print("Saved WARC to: \(warcURL.path)")

// Recursive site crawl
var opts = ArchiveOptions(outputPath: URL(fileURLWithPath: "output/"))
opts.maxDepth = 2
opts.allowedDomains = ["example.com"]
opts.rateLimit = "500k"           // 500 KB/s max
opts.userAgent = "MyArchive/1.0"
opts.retries = 3
let warcURL2 = try await archiver.crawl(
    url: URL(string: "https://example.com")!,
    options: opts
)

Configuration Reference

Option Type Default Description
outputPath URL required Output directory (created if absent)
compress Bool true Per-record GZIP → .warc.gz
maxDepth Int? nil Max crawl recursion depth
allowedDomains [String] [] Restrict crawl to these domains
rateLimit String? nil e.g. "100k", "1m"
timeout TimeInterval 30 Per-connection timeout (seconds)
retries Int 3 Retry count on failure
userAgent String? nil Custom User-Agent header
additionalHeaders [String: String] [:] Extra HTTP headers
urlFilter String? nil Reject pattern, e.g. "*.mp4,*.zip"
luaScripts [URL] [] wget-lua Lua script paths
username / password String? nil HTTP Basic Auth
wgetAtPath URL? nil Override wget-at binary path

Reading WARC Files

let reader = try WARCReader(path: URL(fileURLWithPath: "archive.warc.gz"))
for try await record in reader {
    print("\(record.type.rawValue): \(record.targetURI?.absoluteString ?? "-")")
    print("  Size: \(record.contentLength) bytes, Date: \(WARCDate.string(from: record.date))")
}

Writing WARC Files

let writer = try WARCWriter(path: URL(fileURLWithPath: "output.warc.gz"), compress: true)

// Always start with a warcinfo record (recommended by WARC 1.1)
try writer.write(.warcinfo(block: Data("software: my-app/1.0\r\noperator: me\r\n".utf8)))

// Write a response record
let httpResponse = Data("HTTP/1.1 200 OK\r\nContent-Type: text/html\r\n\r\n<html>…</html>".utf8)
var r = WARCRecord.response(targetURI: url, block: httpResponse)
r.blockDigest = WARCDigest.sha256(httpResponse)
try writer.write(r)

try writer.close()

Error Handling

do {
    let warcURL = try await archiver.archive(url: url, options: opts)
} catch WARCArchiverError.binaryNotFound {
    print("wget not found — install via 'brew install wget'")
} catch WARCArchiverError.crawlFailed(let code, let stderr) {
    print("wget exited \(code): \(stderr)")
} catch WARCArchiverError.invalidURL(let url) {
    print("Invalid URL: \(url)")
}

Platform Notes

  • macOS arm64 / x86_64: Fully supported. Install wget via Homebrew.
  • Linux x86_64: Fully supported. Install wget via apt/yum.
  • GZIP: Per-record compression uses zlib (available on all platforms); link zlib1g-dev on Linux.
  • Swift concurrency: WARCArchiver is an actor; WARCWriter and WARCReader are safe for single-task use.

WARC Implementation Notes

  • Record IDs use <urn:uuid:UUID4> scheme (WARC spec section 5.1).
  • Dates are ISO 8601 UTC (YYYY-MM-DDThh:mm:ssZ), fractional seconds supported.
  • Digests use sha256:BASE32 (RFC 4648, no padding) via CryptoKit.
  • WARC-Concurrent-To is modelled as [String] — the only field that may repeat.
  • Per-record GZIP means each GZIP member is independently decompressable (Annex D).
  • Recommended WARC file size limit is 1 GB (Annex C).

Examples

The Examples/ directory contains seven runnable examples, from minimal to advanced:

# Target What it demonstrates
01 SinglePageArchive Archive one URL with default options — the minimal use case
02 RecursiveCrawl Recursive site crawl with depth, domain allow-list, rate limit, timeout, retries, user-agent, and URL filter
03 AuthAndHeaders HTTP Basic Auth + arbitrary custom request headers + browser-spoof user-agent
04 CustomBinaryAndLua Explicit wget-at binary path and Lua script injection (ArchiveTeam's wget-lua features)
05 ReadWARC Read any .warc / .warc.gz, print per-record summaries, extract fields
06 WriteWARCManually Construct a WARC from scratch using WARCWriter — covers all 8 record types, digests, concurrent-to linking, revisit dedup, and segmentation
07 ErrorHandling Exhaustive handling of every WARCArchiverError case plus I/O edge cases

Run any example with:

swift run <TargetName> [arguments]

# Examples:
swift run SinglePageArchive https://example.com ./output
swift run RecursiveCrawl   https://example.com ./output
swift run AuthAndHeaders   https://private.example.com ./output myuser mypass
swift run CustomBinaryAndLua https://example.com ./output /usr/local/bin/wget-at
swift run ReadWARC         ./output/archive.warc.gz
swift run WriteWARCManually ./output.warc.gz
swift run ErrorHandling

License

MIT

Description

  • Swift Tools 6.2.0
View More Packages from this Author

Dependencies

  • None
Last updated: Wed Mar 11 2026 12:11:32 GMT-0900 (Hawaii-Aleutian Daylight Time)