SwiftScraper

0.6.0

Web scraping library for Swift
Nef10/SwiftScraper

What's New

0.6.0 Download Step

2023-03-12T08:06:35Z

Changes

🚀 Features

🧰 Maintenance

  • Bump release-drafter from 5.22.0 to 5.23.0 @file-sync-app (#71)
  • Bump swift-actions/setup-swift to 1.22.0 @file-sync-app (#69)
  • Migrate deprecated set-output to environment files @file-sync-app (#67)
  • Bump sticky-pull-request-comment to 2.5.0 @file-sync-app (#66)
  • Bump sticky-pull-request-comment to 2.4.0 @file-sync-app (#64)
  • Bump tibdex/github-app-token from 1.7.0 to 1.8.0 @file-sync-app (#63)
  • Bump actions/github-script from 6.3.3 to 6.4.0 @file-sync-app (#62)
  • Bump peaceiris/actions-gh-pages to 3.9.2 @file-sync-app (#59)
  • Bump peaceiris/actions-gh-pages to 3.9.1 @file-sync-app (#57)
  • Bump release-drafter to v5.22.0 @file-sync-app (#56)
  • Bump setup-swift to 1.21.0 @file-sync-app (#53)
  • Update SwiftLint config for 0.50.3 @file-sync-app (#52)
  • Bump sticky-pull-request-comment to 2.3.1 @file-sync-app (#51)
  • Bump peaceiris/actions-gh-pages to 3.9.0 @file-sync-app (#50)
  • Bump very_good_coverage to 2.1.0 @file-sync-app (#49)
  • Pin SwiftLint version to 0.50.0 @file-sync-app (#48)
  • Bump sticky-pull-request-comment and setup-swift dependencies @file-sync-app (#40)
  • Fix swiftlint warnings @Nef10 (#47)
  • Bump release-drafter from 5.21.0 to 5.21.1 @file-sync-app (#39)
  • Bump tibdex/github-app-token from 1.6.0 to 1.7.0 @file-sync-app (#38)
  • Update Nef10/lcov-reporter-action to 0.4.0 @file-sync-app (#35)
  • Bump actions/github-script from 6.3.1 to 6.3.3 @file-sync-app (#34)
  • Bumpsticky-pull-request-comment to 2.2.1 @file-sync-app (#33)
  • Bump very_good_coverage from 1.2.1 to 2.0.0 @file-sync-app (#30)
  • Bump actions/github-script from 6.2.0 to 6.3.1 @file-sync-app (#29)
  • Bump Nef10/lcov-reporter-action to 0.3.2 @file-sync-app (#28)
  • Bump swift-actions/setup-swift to 1.18.0 @file-sync-app (#27)
  • Bump release-drafter from 5.20.1 to 5.21.0 @file-sync-app (#26)
  • Update SwiftLint config for 0.49.1 @file-sync-app (#23)
  • Update SwiftLint config for 0.49.0 @file-sync-app (#22)
  • Bump actions/github-script from 6.1.1 to 6.2.0 @file-sync-app (#21)
  • Create .jazzy.yaml @Nef10 (#18)
  • Bump release-drafter from 5.20.0 to 5.20.1 @file-sync-app (#19)

SwiftScraper

CI Status Documentation percentage License: MIT Latest version platforms supported: macOS | iOS SPM compatible

Web scraping library for Swift. This is a fork of cweatureapps/SwiftScraper, to offer the library as swift package.

Overview

This framework provides a simple way to declaratively define a series of steps in Swift that represent how to scrape a web site, allowing the app to read this web page data.

Features

  • Declarative API - clearly define the steps to run and avoid the spaghetti 🍝 code that comes with using the WebView and the delegate pattern
  • Custom JavaScript integration - Simple integration with custom JavaScript to perform complicated scraping, using the language of the web to process the web page
  • Perform custom processing at each step
  • Passing data between steps
  • Control flow to determine which step to run next, allowing basic conditionals and loops

Tutorial

In this tutorial, we'll cover the basic usage of this framework by performing a Google search.

Add package via Swift Package Manager

Add this dependency to your Package.swift file:

.package(url: "https://github.com/Nef10/SwiftScraper.git", .exact("X.Y.Z")),

Note: as per semantic versioning all versions changes < 1.0.0 can be breaking, so please use .exact for now

JavaScript setup

By convention, all the steps will use the functions exposed in a single module which is defined in a single JavaScript file.

For this exercise, create a new file called GoogleSearch.js.

Start by creating the blank JavaScript module structure, making sure the module name is the same as the file name:

var GoogleSearch = (function() {
    return {
    };
})()

Loading a web page

Create a new view controller.

Import the framework:

import SwiftScraper

In the view controller, we'll create a step and run it:

var stepRunner: StepRunner!

override func viewDidLoad() {
    super.viewDidLoad()
    let step1 = OpenPageStep(path: "https://www.google.com")
    stepRunner = StepRunner(moduleName: "GoogleSearch", steps: [step1])
    stepRunner.insertWebViewIntoView(parent: view)
    stepRunner.run()
}

When you run this, you will see a web view opening the Google home page.

The web view typically needs to be have a visible frame size, because web sites often use responsive breakpoints and will even sometimes change the HTML structure based on the dimensions of the page.

The insertWebViewIntoView method helps you to easily insert the web view into any NSView / UIView that you have, while automatically constraining it to the same size. It is up to you to set up the dimensions of the parent view, or you can even hide it where the user cannot see it.

Check that the page loaded

We can add an assertion to run some JavaScript code when the page loads, to make sure the page that loaded is expected. We can do this by referencing a JavaScript function which is exposed by the module.

In the GoogleSearch.js file, add the following function which will just check the title of the page is correct.

var GoogleSearch = (function() {
    function assertGoogleTitle() {
        return document.title == "Google";
    }
    return {
        assertGoogleTitle: assertGoogleTitle
    };
})()

In the view controller where you created the step, include the name of the assertion function:

let step1 = OpenPageStep(path: "https://www.google.com", assertionName: "assertGoogleTitle")

The assertion function runs immediately when the page loads. Sometimes, what you are asserting may not be ready at the point when the page loads, as the website may modify the page asynchronously after loading. In this case check out the advanced usage to add a step which waits till the page is loaded.

Observe progress of the run

You can observe the progress of the execution by adding a stateObserver to the stepRunner.

    stepRunner.stateObservers.append { newValue in
        print("-----", newValue, "-----")
        switch newValue {
        case .inProgress(let index):
            print("About to run step at index", index)
        case .failure(let error):
            print("Failed: \(error.localizedDescription)")
        case .success:
            print("Finished successfully")
        default:
            break
        }
    }
    stepRunner.run()

Run script that loads page

Let's now run some custom JavaScript to submit a Google search. This is the PageChangeStep which runs some JavaScript, which will result in a new page being loaded. When the page is loaded, it will proceed to the next step.

Firstly, in the GoogleSearch.js file, add the following 2 functions which perform the search, and exposes them in the module:

var GoogleSearch = (function() {

    // ...

    function performSearch(searchText) {
        document.querySelector('input[type="text"], input[type="Search"]').value = searchText;
        document.forms[0].submit();
    }
    function assertSearchResultTitle() {
        return document.title == "SwiftScraper iOS - Google Search";
    }
    return {
        assertGoogleTitle: assertGoogleTitle,
        performSearch: performSearch,
        assertSearchResultTitle: assertSearchResultTitle
    };
})()

In the view controller, add step 2 which is the PageChangeStep, referencing the JavaScript functions you just implemented:

let step2 = PageChangeStep(functionName: "performSearch", params: "SwiftScraper iOS", assertionName: "assertSearchResultTitle")

Notice the params parameter in the initializer, which allows you to pass data to the JavaScript function.

Make sure to include this in the array of steps when you create the StepRunner:

stepRunner = StepRunner(moduleName: "GoogleSearch", steps: [step1, step2])

Run script and process

We're at the last step - we can run a script to scrape the contents of the page. Add the following JavaScript function which will get the search results, and return an Array of JSON objects with the text and href of each link.

var GoogleSearch = (function() {

    // ...

    function getSearchResults() {
        var headings = document.querySelectorAll('h3.r');
        return Array.prototype.slice.call(headings).map(function (h3) {
            return { 'text': h3.innerText, 'href': h3.childNodes[0].href };
        });
    }

    return {
        assertGoogleTitle: assertGoogleTitle,
        performSearch: performSearch,
        assertSearchResultTitle: assertSearchResultTitle,
        getSearchResults: getSearchResults
    };
})()

In the Swift code, add the 3rd step which is a ScriptStep, a step which runs a JavaScript function and returns the response that the function returns.

let step3 = ScriptStep(functionName: "getSearchResults") { response, _ in
    if let responseArray = response as? [JSON] {
        responseArray.forEach { json in
            if let text = json["text"], let href = json["href"] {
                print(text, "(", href, ")")
            }
        }
    }
    return .proceed
}

And make sure to include this in the array of steps when you create the StepRunner:

stepRunner = StepRunner(moduleName: "GoogleSearch", steps: [step1, step2, step3])

Run this. You should see the steps complete successfully, and print the search results to the console.

Congratulations! You've finished the tutorial on the basic usage of this library! 🎉

Advanced Usage

Run script that returns data async

It is possible to run some JavaScript that does not return immediately, and wait for it to asynchronously call back the Swift code after some time has passed. For example, you may need to do something on the web page, poll for the operation to complete, and then pass the data back to Swift.

To pass data back to Swift world, call SwiftScraper.postMessage(), passing a single object that can be serialized back to a Swift object.

In this example, we'll do a Google image search, and then scroll down to the bottom. The infinite scroll pattern employed here will load more images when we do this, and we'll do a count of the images before and after the scroll.

var GoogleSearch = (function() {

    // ...

    function scrollAndCountImages() {
        var firstCount = document.querySelectorAll('img').length;
        window.scrollTo(0, document.body.scrollHeight);
        setTimeout(function () {
            var secondCount = document.querySelectorAll('img').length;
            SwiftScraper.postMessage({'first': firstCount, 'second': secondCount});
        }, 2000);
    }

    return {
        assertGoogleTitle: assertGoogleTitle,
        performSearch: performSearch,
        assertSearchResultTitle: assertSearchResultTitle,
        getSearchResults: getSearchResults,
        scrollAndCountImages: scrollAndCountImages
    };
})()

For those familiar with WKWebView, the SwiftScraper.postMessage() function is just an alias for webkit.messageHandlers.swiftScraperResponseHandler.postMessage()

In Swift, use the AsyncScriptStep, which is used in the same way as ScriptStep, with the difference being the handler is not called until SwiftScraper.postMessage is called. It is expected that the JavaScript function itself does not return anything.

let step1 = OpenPageStep(path: "https://www.google.com.au/search?tbm=isch")

let step2 = PageChangeStep(
    functionName: "performSearch",
    params: "ankylosaurus")

let step3 = AsyncScriptStep(functionName: "scrollAndCountImages") { response, _ in
    if let json = response as? JSON {
        if let first = json["first"], let second = json["second"] {
            print("first: ", first, "second: ", second)
        }
    }
    return .proceed
}

Process Step

Use the ProcessStep when you need a step that requires some custom action to be performed.

let processStep = ProcessStep { model in
    // perform some custom action here
    return .proceed
}

Two main concepts to note here are:

  • The model parameter, used for passing model data between steps
  • The return value, which can be used for control flow

These concepts apply to the ProcessStep, ScriptStep and AsyncScriptStep. We'll explore them in the next two sections.

If you need to execute a asynchronous actions, use the AsyncProcessStep instead:

let processStep = AsyncProcessStep { model, completion in
    // perform some custom action here
    completion(model, .proceed)
}

Passing model data

The ProcessStep, ScriptStep and AsyncScriptStep all have a handler closure to perform processing, and these handlers all have a model parameter of type inout JSON. Modify this JSON dictionary to save data during one step, and then read it in another step.

Let's modify the AsyncScriptStep from the previous section to save the before and after counts to the dictionary.

let step3 = AsyncScriptStep(functionName: "scrollAndCountImages") { response, model in  // notice the model param
    if let json = response as? JSON {
        if let first = json["first"], let second = json["second"] {
            print("first: ", first, "second: ", second)

            // Save the data to the model dictionary
            model["first"] = first
            model["second"] = second
        }
    }
    return .proceed
}

Control Flow

The return value is an enum which can be used for rudimentary control flow. We've seen .proceed which means to go to the next step. The .jumpToStep(n) allows you to jump to another step, either before or after the current step. This allows you to define loops (by jumping back) as well as conditionals (by jumping forward).

Let's continue the infinite scrolling image search example, and add a ProcessStep that will keep looping back to step3 until the before count and after count are the same, meaning there are no more images on the page to load.

Add this step as the last step to run. When you run this, you should see the screen keep scrolling down until no more images can be found.

let conditionStep = ProcessStep { model in
    if let first = model["first"] as? Int,
        let second = model["second"] as? Int,
        first == second {
        return .proceed
    } else {
        return .jumpToStep(2) // This is a zero-based index, i.e. step3
    }
}

This technique is most useful for repeating a sequence of steps. While it can also be used to model IF-THEN style conditionals, it is essentially a GOTO construct and can easily lead to unmaintainable spaghetti 🍝 steps.

You can also have an early exit from the steps. The return value of .finish will stop execution as a success, while .failure(Error) will stop execution with a failure.

Download Step

A step that downloads the content of a given URL and returns it as a string.

let downloadCSV = DownloadStep(url: myURL) { response, model in
    model["fileContent"] = response as? String // For example, save the content into the model
    return .proceed // you can use any control flow logic you like (see above)
}

This step is helpful if you want to get content from a non-HTML page, like a CSV document. On HTML documents you can use a normal ScriptStep to read the document contents, however JavaScript cannot run on CSVs for example.

Wait Step

A step that waits for a set period of time.

let waitStep = WaitStep(waitTimeInSeconds: 0.5)

Wait for Condition Step

This is a step that waits for a condition to become true before proceeding, or it will fail if the condition is still false when the timeout occurs.

In this example, the iOS code will repeatedly call the JavaScript function testThatStuffIsReady, proceeding as soon as it returns true, or failing with a timeout if it doesn't return true within 2 seconds.

let waitForConditionStep = WaitForConditionStep(
    assertionName: "testThatStuffIsReady",
    timeoutInSeconds: 2)

Using parameter from the model

For the steps accepting params, you can alternatively use paramsKeys and pass in an array of keys for the model. The JavaScript function will then receive parameters corresponding to the current value of this key in the model dictionary. You can use this if the value of your parameters is not yet known when you create the step. For example, use a AsyncProcessStep to ask your user for a value and then save it in the model.

List of Steps

Here is the full list of steps discussed above:

  • OpenPageStep - Loads a page and optionally executes a JavaScript assertion function to see that the page loaded correctly
  • PageChangeStep - Calls a JavaScript function which is expected to navigate to a different page. Optionally executes a JavaScript assertion function to see that the new page loaded correctly
  • ScriptStep - Runs a JavaScript function and returns the return value
  • AsyncScriptStep - Runs an asynchronous JavaScript function and returns the return value
  • ProcessStep - Runs swift code, which allows you to execute actions outside the scraper or modify the control flow
  • AsyncProcessStep - Runs swift code, which allows you to execute asynchronous actions outside the scraper or modify the control flow
  • DownloadStep - Downloads content from a URL and returns it as string
  • WaitStep - Waits a fixed number of seconds
  • WaitForConditionStep - Repeatedly calls a JavaScript function till it returns true (or times out)

Examples

If you want to read more example code:

FAQ

I'm getting the error: "An SSL error has occurred and a secure connection to the server cannot be made."

App Transport Security (ATS) rules apply web views as well. If the website you are loading is not HTTPS, or uses outdated security protocols, macOS / iOS will refuse to load it.

The quick workaround is to disable ATS by putting the following setting in your Info.plist

<key>NSAppTransportSecurity</key>
<dict>
    <key>NSAllowsArbitraryLoads</key>
    <true/>
</dict>

However, at some point in the future, Apple may require that all apps submitted to the App Store support ATS.

Description

  • Swift Tools 5.5.0
View More Packages from this Author

Dependencies

  • None
Last updated: Tue Dec 17 2024 15:33:58 GMT-1000 (Hawaii-Aleutian Standard Time)