A production-ready Swift 6.2 library for creating WARC 1.1 web archives. Powered by wget-at (ArchiveTeam's wget-lua) for crawling.
- Full WARC 1.1 compliance — all 8 record types, all named fields, correct CRLF framing
- Single-page capture —
archive(url:options:)fetches a page and all its inline assets - Recursive crawling —
crawl(url:options:)with configurable depth, domain filters, rate limits - wget-at integration — robots.txt, deduplication, Lua scripting, battle-tested crawling
- Per-record GZIP —
.warc.gzoutput following WARC 1.1 Annex D best practice - Swift WARC I/O — read and write WARC files independently with
WARCReader/WARCWriter - Async/await — Swift 6.2 strict concurrency,
WARCArchiveris a safeactor
- Swift 6.2+
- macOS 13+ or Linux
- wget or wget-at: see Installation
- zlib:
brew install zlib(macOS) /apt-get install zlib1g-dev(Linux)
.package(url: "https://github.com/yourorg/warc-swift", from: "1.0.0")
Add "warc-swift" to your target dependencies.
warc-swift delegates all HTTP crawling to wget (or the enhanced wget-at from ArchiveTeam).
macOS (standard wget via Homebrew):
brew install wget
Ubuntu/Debian:
apt-get install wget
Build ArchiveTeam's wget-lua (adds Lua scripting, advanced WARC options):
chmod +x Scripts/build-wget-at.sh
./Scripts/build-wget-at.sh
This builds GNU Wget 1.21.3-at with +ssl/openssl +lua/luajit +psl and installs it into
Sources/warc-swift/Resources/Binaries/wget-at-darwin-arm64 (macOS arm64).
Note: The bundled wget-at binary links against Homebrew dylibs (
openssl@3,luajit,libpsl). Install them with:brew install openssl@3 luajit libpsl
Bundle into the package (fallback — copies the system wget):**
chmod +x Scripts/fetch-binaries.sh
./Scripts/fetch-binaries.sh
import warc_swift
let archiver = WARCArchiver()
// Archive a single page (fetches page + all inline assets: images, CSS, JS)
let warcURL = try await archiver.archive(
url: URL(string: "https://example.com")!,
options: ArchiveOptions(outputPath: URL(fileURLWithPath: "output/"))
)
print("Saved WARC to: \(warcURL.path)")
// Recursive site crawl
var opts = ArchiveOptions(outputPath: URL(fileURLWithPath: "output/"))
opts.maxDepth = 2
opts.allowedDomains = ["example.com"]
opts.rateLimit = "500k" // 500 KB/s max
opts.userAgent = "MyArchive/1.0"
opts.retries = 3
let warcURL2 = try await archiver.crawl(
url: URL(string: "https://example.com")!,
options: opts
)
| Option | Type | Default | Description |
|---|---|---|---|
outputPath |
URL |
required | Output directory (created if absent) |
compress |
Bool |
true |
Per-record GZIP → .warc.gz |
maxDepth |
Int? |
nil |
Max crawl recursion depth |
allowedDomains |
[String] |
[] |
Restrict crawl to these domains |
rateLimit |
String? |
nil |
e.g. "100k", "1m" |
timeout |
TimeInterval |
30 |
Per-connection timeout (seconds) |
retries |
Int |
3 |
Retry count on failure |
userAgent |
String? |
nil |
Custom User-Agent header |
additionalHeaders |
[String: String] |
[:] |
Extra HTTP headers |
urlFilter |
String? |
nil |
Reject pattern, e.g. "*.mp4,*.zip" |
luaScripts |
[URL] |
[] |
wget-lua Lua script paths |
username / password |
String? |
nil |
HTTP Basic Auth |
wgetAtPath |
URL? |
nil |
Override wget-at binary path |
let reader = try WARCReader(path: URL(fileURLWithPath: "archive.warc.gz"))
for try await record in reader {
print("\(record.type.rawValue): \(record.targetURI?.absoluteString ?? "-")")
print(" Size: \(record.contentLength) bytes, Date: \(WARCDate.string(from: record.date))")
}
let writer = try WARCWriter(path: URL(fileURLWithPath: "output.warc.gz"), compress: true)
// Always start with a warcinfo record (recommended by WARC 1.1)
try writer.write(.warcinfo(block: Data("software: my-app/1.0\r\noperator: me\r\n".utf8)))
// Write a response record
let httpResponse = Data("HTTP/1.1 200 OK\r\nContent-Type: text/html\r\n\r\n<html>…</html>".utf8)
var r = WARCRecord.response(targetURI: url, block: httpResponse)
r.blockDigest = WARCDigest.sha256(httpResponse)
try writer.write(r)
try writer.close()
do {
let warcURL = try await archiver.archive(url: url, options: opts)
} catch WARCArchiverError.binaryNotFound {
print("wget not found — install via 'brew install wget'")
} catch WARCArchiverError.crawlFailed(let code, let stderr) {
print("wget exited \(code): \(stderr)")
} catch WARCArchiverError.invalidURL(let url) {
print("Invalid URL: \(url)")
}
- macOS arm64 / x86_64: Fully supported. Install wget via Homebrew.
- Linux x86_64: Fully supported. Install wget via apt/yum.
- GZIP: Per-record compression uses zlib (available on all platforms); link
zlib1g-devon Linux. - Swift concurrency:
WARCArchiveris anactor;WARCWriterandWARCReaderare safe for single-task use.
- Record IDs use
<urn:uuid:UUID4>scheme (WARC spec section 5.1). - Dates are ISO 8601 UTC (
YYYY-MM-DDThh:mm:ssZ), fractional seconds supported. - Digests use
sha256:BASE32(RFC 4648, no padding) via CryptoKit. WARC-Concurrent-Tois modelled as[String]— the only field that may repeat.- Per-record GZIP means each GZIP member is independently decompressable (Annex D).
- Recommended WARC file size limit is 1 GB (Annex C).
The Examples/ directory contains seven runnable examples, from minimal to advanced:
| # | Target | What it demonstrates |
|---|---|---|
| 01 | SinglePageArchive |
Archive one URL with default options — the minimal use case |
| 02 | RecursiveCrawl |
Recursive site crawl with depth, domain allow-list, rate limit, timeout, retries, user-agent, and URL filter |
| 03 | AuthAndHeaders |
HTTP Basic Auth + arbitrary custom request headers + browser-spoof user-agent |
| 04 | CustomBinaryAndLua |
Explicit wget-at binary path and Lua script injection (ArchiveTeam's wget-lua features) |
| 05 | ReadWARC |
Read any .warc / .warc.gz, print per-record summaries, extract fields |
| 06 | WriteWARCManually |
Construct a WARC from scratch using WARCWriter — covers all 8 record types, digests, concurrent-to linking, revisit dedup, and segmentation |
| 07 | ErrorHandling |
Exhaustive handling of every WARCArchiverError case plus I/O edge cases |
Run any example with:
swift run <TargetName> [arguments]
# Examples:
swift run SinglePageArchive https://example.com ./output
swift run RecursiveCrawl https://example.com ./output
swift run AuthAndHeaders https://private.example.com ./output myuser mypass
swift run CustomBinaryAndLua https://example.com ./output /usr/local/bin/wget-at
swift run ReadWARC ./output/archive.warc.gz
swift run WriteWARCManually ./output.warc.gz
swift run ErrorHandling