QuranRecognitionKit

Native Swift SDK for offline Quran verse recognition on iOS.

The package is a Swift implementation of the offline-tarteel pipeline shape:

Capture or provide 16 kHz mono audio.
Compute 80-bin NeMo-compatible mel spectrogram features.
Run the ONNX FastConformer CTC model with ONNX Runtime.
Greedy CTC decode and fuzzy-match the transcript against all 6,236 Quran verses.
Track recitation progress across verses and recover into discovery when the user starts another surah.

The SDK bundles the zipped ONNX model, vocab.json, and quran.json through Bundle.module.

Requirements

Swift 6.2 or newer.
iOS 17 or newer.
Xcode with a Swift 6.2 toolchain or newer for app integration.
Microphone permission if you use live recognition through startListening.

QuranRecognitionKit depends on Microsoft's onnxruntime-swift-package-manager package. Swift Package Manager resolves this dependency automatically.

Installation

Local Package

In Xcode:

Open the app project.
Choose File > Add Package Dependencies.
Choose Add Local.
Select this package directory.
Add the QuranRecognitionKit product to the app target.

Then import it:

import QuranRecognitionKit

GitHub Package

In Xcode:

Choose File > Add Package Dependencies.
Enter https://github.com/akhandafm17/QuranRecognitionKit.git.
Select version 0.1.8 or newer.
Add the QuranRecognitionKit product to the app target.

In a Swift package manifest:

.package(url: "https://github.com/akhandafm17/QuranRecognitionKit.git", from: "0.1.8")

The package requires iOS 17 or newer.

App Permissions

Live recognition uses the device microphone. Add a microphone usage description to the host app's Info.plist:

<key>NSMicrophoneUsageDescription</key>
<string>Microphone access is used to recognize Quran recitation on device.</string>

The SDK configures an AVAudioSession for recording when startListening is called. If your app also plays audio, coordinate calls to your own audio session setup with recognition start and stop.

ModelDownloader uses URLSession. If your model URL is not HTTPS, configure App Transport Security in the host app.

Bundled Model

QuranRecognitionKit includes a bundled zipped ONNX model. For the common path, create a ready-to-use recognizer from the bundled model:

let recognizer = try await QuranRecognizer.bundled()
try await recognizer.prepare()

On first use, the SDK extracts and verifies the bundled model into the app cache directory. Later calls reuse the extracted model if the checksum still matches.

The bundled archive is about 96 MB, and the extracted ONNX model is about 126 MB. This makes the Swift package larger, but keeps integration ready-to-go and fully offline after installation.

The expected audio input for direct recognition is 16 kHz mono Float PCM samples. Live recognition captures microphone input and converts it internally before inference.

Custom Model

You can still pass your own compatible ONNX model URL:

let recognizer = QuranRecognizer(modelURL: modelURL)
try await recognizer.prepare()

The model must produce CTC logits for the same vocabulary bundled in vocab.json. If the model vocabulary size does not match, prepare() throws RecognitionError.vocabModelMismatch.

If you use ModelDownloader, an expected SHA-256 is required.

let downloader = ModelDownloader()
let localModelURL = try await downloader.download(
    from: modelArchiveURL,
    expectedSHA256: expectedChecksum,
    destinationURL: destinationURL
) { progress in
    print("Download progress: \(progress)")
}

ModelDownloader verifies and stores bytes. It does not unzip archives; extract compressed model files in the host app before passing the ONNX URL to QuranRecognizer.

Usage

Live Recognition

import QuranRecognitionKit

let configuration = QuranRecognizer.Configuration(
    processingInterval: 0.20,
    discoveryWindowSeconds: 3.5,
    trackingWindowSeconds: 2.25,
    minimumDiscoveryWindowSeconds: 1.75,
    minimumTrackingWindowSeconds: 0.90,
    discoveryFreshAudioSeconds: 0.30,
    trackingFreshAudioSeconds: 0.20,
    maximumBufferedSeconds: 6.0,
    intraOpThreadCount: 1,
    minimumSpeechRMS: 0.0015,
    minimumSpeechPeak: 0.006,
    minimumSpeechFrameRatio: 0.03,
    suppressLowInformationTranscriptions: true,
    debugLogging: false
)

let recognizer = try await QuranRecognizer.bundled(configuration: configuration)
try await recognizer.prepare()

let session = try recognizer.startListening(surahHint: 1)

for await event in session.events {
    switch event {
    case .audioInput(let quality):
        if !quality.isSpeechLikely {
            print("Waiting for clear recitation: \(quality.status)")
        }
    case .transcription(let text):
        // Intended for live UI. Low-information fragments are suppressed by default.
        print(text)
    case .verseDetected(let verse):
        print("Detected \(verse.surahNumber):\(verse.verseNumber)")
    case .stateChanged(let state):
        print(state)
    case .error(let error):
        print(error)
    }
}

Stop safely:

session.stop()

Pass the current surah number as surahHint when recognition starts from a Quran reader. Discovery will prefer that surah first, which improves startup speed and reduces false jumps for the common case where the user recites from the displayed surah.

One-Shot Recognition

Use recognize(samples:surahHint:) when you already have 16 kHz mono Float samples:

let recognizer = QuranRecognizer(modelURL: modelURL)
try await recognizer.prepare()

if let verse = try await recognizer.recognize(samples: samples, surahHint: 1) {
    print("\(verse.surahNumber):\(verse.verseNumber)")
}

Manual Harness

You can also run a non-streaming manual check with:

swift run QuranRecognitionManualHarness /path/to/FastConformerQuranCTC.onnx /path/to/audio.wav 1 1

Public API

Recognizer

public final class QuranRecognizer: @unchecked Sendable {
    public init(modelURL: URL, configuration: Configuration = Configuration())
    public static func bundled(configuration: Configuration = Configuration()) async throws -> QuranRecognizer

    public func prepare() async throws
    public func startListening(surahHint: Int? = nil) throws -> QuranRecognitionSession
    public func recognize(samples: [Float], surahHint: Int? = nil) async throws -> RecognizedVerse?
}

Call prepare() once before recognition. It loads bundled resources, creates the ONNX Runtime session, and validates the model against the bundled vocabulary.

Bundled Model

public enum BundledQuranModel {
    public static let archiveFileName: String
    public static let modelFileName: String

    public static func modelURL() async throws -> URL
    public static func removeExtractedModel() throws
}

modelURL() extracts and verifies the bundled model if needed, then returns the local extracted ONNX file URL. removeExtractedModel() removes the cached extracted model so it can be recreated from the bundled archive.

Session

public final class QuranRecognitionSession: @unchecked Sendable {
    public let events: AsyncStream<RecognitionEvent>
    public func stop()
}

events yields microphone quality updates, decoded transcripts, verse detections, state changes, and errors. Call stop() when the user leaves the recognition flow or disables listening.

Events And Results

public enum RecognitionState: Sendable, Equatable {
    case idle
    case preparing
    case listening
    case processing
    case stopped
}

public struct RecognizedVerse: Sendable, Equatable {
    public let surahNumber: Int
    public let verseNumber: Int
    public let ayahEnd: Int?
    public let confidence: Double
    public let arabicText: String
}

public enum AudioInputStatus: Sendable, Equatable {
    case silence
    case tooLittleSpeech
    case speech
    case clipped
}

public struct AudioInputQuality: Sendable, Equatable {
    public let rms: Float
    public let peak: Float
    public let rmsDecibels: Float
    public let speechFrameRatio: Double
    public let windowSeconds: Double
    public let status: AudioInputStatus
    public let isSpeechLikely: Bool
}

public enum RecognitionEvent: Sendable, Equatable {
    case audioInput(AudioInputQuality)
    case transcription(String)
    case verseDetected(RecognizedVerse)
    case stateChanged(RecognitionState)
    case error(RecognitionError)
}

ayahEnd is set when the matcher identifies a span that covers multiple ayahs. For single-ayah detections it is nil.

Model Downloader

public actor ModelDownloader {
    public init()

    public func download(
        from sourceURL: URL,
        expectedSHA256: String,
        destinationURL: URL,
        progress: (@Sendable (Double) -> Void)? = nil
    ) async throws -> URL
}

ModelDownloader streams a remote file to disk and verifies SHA-256 before replacing the destination file.

Errors

public enum RecognitionError: Error, Sendable, Equatable {
    case resourceMissing(String)
    case resourceCorrupt(String)
    case modelMissing(String)
    case modelCorrupt(String)
    case vocabModelMismatch(expected: Int, actual: Int)
    case microphonePermissionDenied
    case microphoneUnavailable(String)
    case unsupportedPlatform
    case invalidAudioSampleRate(expected: Double, actual: Double)
    case inferenceFailed(String)
    case downloadFailed(String)
    case downloadChecksumMismatch(expected: String, actual: String)
    case notPrepared
    case alreadyStopped
}

Configuration

extension QuranRecognizer {
    public struct Configuration: Sendable, Equatable {
        public var processingInterval: TimeInterval
        public var discoveryWindowSeconds: Double
        public var trackingWindowSeconds: Double
        public var minimumDiscoveryWindowSeconds: Double
        public var minimumTrackingWindowSeconds: Double
        public var discoveryFreshAudioSeconds: Double
        public var trackingFreshAudioSeconds: Double
        public var maximumBufferedSeconds: Double
        public var intraOpThreadCount: Int
        public var minimumSpeechRMS: Float
        public var minimumSpeechPeak: Float
        public var minimumSpeechFrameRatio: Double
        public var suppressLowInformationTranscriptions: Bool
        public var debugLogging: Bool
    }
}

The default streaming setup is tuned for mobile responsiveness while keeping rolling windows for recognition stability:

Processing cadence: 200 ms. Inference is serialized, so slow devices skip overlapping cycles instead of piling up work.
Discovery window: 3.5 seconds.
Tracking window: 2.25 seconds.
First inference gate: 1.75 seconds in discovery, 0.9 seconds in tracking.
Fresh audio gate after the first inference: 300 ms in discovery and 200 ms in tracking.
ONNX thread count: 1 by default to avoid CPU contention and UI stutter on phones.
Audio quality gate: skips silence, very weak speech, and clipped windows before ONNX inference.
Transcript quality gate: suppresses short fragments such as single letters from the public .transcription event by default.

Performance Notes

The SDK avoids main-thread inference and audio processing:

prepare() loads resources and ONNX Runtime on a background queue.
Streaming audio capture appends bounded buffers from the audio callback.
Inference runs on a serial background queue and reuses the ONNX session.
Mel computation reuses FFT setup, Hann window, and mel filterbank.
The audio buffer is capped by maximumBufferedSeconds.
Low-speech audio windows are skipped before ONNX inference.
A capture watchdog detects a silent input tap (no buffers while the engine reports running) and automatically restarts the audio engine, logging the active input route for diagnosis.
Verse matching uses an evidence index and bounded span search instead of scanning every possible span.
Tracking mode searches locally around the current verse before returning to global discovery.

Current validation:

swift test passes on macOS arm64 with 87 tests.
The test suite covers hinted discovery, same-surah tracking, low-information noise, near-end recovery, post-completion surah switching, ambiguous candidate rejection, and audio-window quality analysis.
A real-recitation replay test (RealRecitationReplayTests) feeds 12 minutes of an actual Surah Al-Baqarah recitation (2:1-2:59) through the real tracker, using per-window decodes produced by the bundled ONNX model with the session's exact streaming window policy, and asserts sequential no-skip/no-regression following with bounded tracking losses.
Scenario tests (RecitationScenarioTests) simulate full recitation sessions across structurally different surahs (Al-Fatihah, Al-Kahf, Al-Mulk, Al-Ikhlas, An-Nas) and recitation styles (clean per-ayah windows, rolling boundary-spanning windows, short fragments, noisy clipped decodes), asserting sequential no-skip/no-regression tracking and per-window latency bounds.
Generic iOS package builds pass with xcodebuild -scheme QuranRecognitionKit-Package -destination 'generic/platform=iOS' build.
App-side iOS generic builds passed with this package integrated.

For release profiling and maintainer checks, see CONTRIBUTING.md.

Troubleshooting

`RecognitionError.notPrepared`

Call try await recognizer.prepare() before startListening or recognize(samples:).

`RecognitionError.modelMissing` Or `modelCorrupt`

For the bundled model path, call BundledQuranModel.removeExtractedModel() and try again. For custom models, verify the local model URL points to an existing, non-empty .onnx file.

`RecognitionError.vocabModelMismatch`

The ONNX model output vocabulary does not match the bundled vocab.json. Use a model exported for the same tokenizer/vocabulary as this SDK.

Microphone Permission Errors

Make sure the host app includes NSMicrophoneUsageDescription. On device, also check iOS Settings if the user previously denied microphone access.

Listening Never Produces Transcriptions

If debug logs show waiting for audio ... bufferSamples=0 repeating after audio engine started, the input tap is not delivering buffers even though the engine is running. This is a system audio state issue, not a recognition issue: the microphone may be held by another app or call, the audio route (often Bluetooth) may be broken, or CoreAudio is in a stale state. The SDK restarts the audio engine automatically up to three times and logs the active input route. The session also emits .audioInput silence events so the host app can show feedback. If it does not recover: disconnect Bluetooth audio devices, close apps that use the microphone, or restart the device.

No Verse Is Detected

Check these first:

The model URL is correct and prepare() succeeded.
The device microphone is receiving clear speech.
The app is not feeding silence, clipped audio, or the wrong sample rate to one-shot recognition.
surahHint matches the current reader context when the user is likely reciting the displayed surah.
debugLogging: true is enabled while diagnosing recognition quality.

Xcode Cannot Find The Package Product

In the host app, remove and re-add the package dependency, then choose File > Packages > Reset Package Caches. Confirm the app target links the QuranRecognitionKit product.

Swift Package Index Still Shows Old Compatibility

Swift Package Index and shields.io cache build and badge results. After pushing a fix or tag, wait for SPI to rescan the package and rebuild the compatibility matrix.

Tests

Run:

swift test

Current tests cover:

Audio quality gating and low-information transcript suppression.
Arabic normalization.
CTC decoding.
Levenshtein distance and word alignment.
Verse matching.
Recitation tracking, surah hints, post-completion discovery, and recovery.
Regression cases replayed from real on-device recitation logs (premature next-ayah advances, garbled rolling-window decodes, ending-word stem collisions, multi-ayah span resolution).
Resource loading.
Model path validation.
Recognition session start/stop lifecycle with a mock capture source.

The manual harness is the integration path for real model + audio validation.

License

QuranRecognitionKit source code is available under the MIT license. See LICENSE.

The bundled model has separate upstream attribution and license terms. See MODEL_NOTICE.md.