whisperkit

0.6.0

Swift native on-device speech recognition with Whisper for Apple Silicon
argmaxinc/WhisperKit

What's New

v0.6.0

2024-04-18T06:22:52Z

Highlights

  • Async batch transcription is here 🎉 contributed by @jkrukowski
    • With this release, you can now simultaneously transcribe multiple audio files at once, fully utilizing the new async prediction APIs released with iOS17/macOS14 (see the wwdc video here).
    • New interface with audioPaths input:
    •   let audioPaths = [
            "/path/to/file1.wav",
            "/path/to/file2.wav"
        ]
        let whisperKit = try await WhisperKit()
        let transcriptionResults: [[TranscriptionResult]?] = await whisperKit.transcribe(audioPaths: audioPaths)
    • You can also use it via the CLI using the new argument --audio-folder "path/to/folder/"
    • Future work will be chunking up single files to significantly speed up long-form transcription
    • Note that this entails breaking changes and deprecations, see below for the full upgrade guide.
  • Several bug fixes, accuracy improvements, and quality of life upgrades by @hewigovens @shawiz and @jkrukowski
    • Every issue raised and PR merged from the community helps make WhisperKit better every release, thank you and keep them coming! 🙏

⚠️ Upgrade Guide

We aim to minimize breaking changes, so with this update we added a few deprecation flags for changed interfaces, which will be removed later but for now are still usable and will not throw build errors. There are some breaking changes for lower level and newer methods so if you do notice build errors click the dropdown below to see the full guide.

Full Upgrade Guide

API changes

Deprecations

WhisperKit

Deprecated

public func transcribe(
    audioPath: String,
    decodeOptions: DecodingOptions? = nil,
    callback: TranscriptionCallback = nil
) async throws -> TranscriptionResult?

use instead

public func transcribe(
    audioPath: String,
    decodeOptions: DecodingOptions? = nil,
    callback: TranscriptionCallback = nil
) async throws -> [TranscriptionResult]

Deprecated

public func transcribe(
    audioArray: [Float],
    decodeOptions: DecodingOptions? = nil,
    callback: TranscriptionCallback = nil
) async throws -> TranscriptionResult?

use instead

public func transcribe(
    audioArray: [Float],
    decodeOptions: DecodingOptions? = nil,
    callback: TranscriptionCallback = nil
) async throws -> [TranscriptionResult]

TextDecoding

Deprecated

func decodeText(
    from encoderOutput: MLMultiArray,
    using decoderInputs: DecodingInputs,
    sampler tokenSampler: TokenSampling,
    options decoderOptions: DecodingOptions,
    callback: ((TranscriptionProgress) -> Bool?)?
) async throws -> [DecodingResult]

use instead

func decodeText(
    from encoderOutput: MLMultiArray,
    using decoderInputs: DecodingInputs,
    sampler tokenSampler: TokenSampling,
    options decoderOptions: DecodingOptions,
    callback: ((TranscriptionProgress) -> Bool?)?
) async throws -> DecodingResult

Deprecated

func detectLanguage(
    from encoderOutput: MLMultiArray,
    using decoderInputs: DecodingInputs,
    sampler tokenSampler: TokenSampling,
    options: DecodingOptions,
    temperature: FloatType
) async throws -> [DecodingResult]

use instead

func detectLanguage(
    from encoderOutput: MLMultiArray,
    using decoderInputs: DecodingInputs,
    sampler tokenSampler: TokenSampling,
    options: DecodingOptions,
    temperature: FloatType
) async throws -> DecodingResult

Breaking changes

  • removed Transcriber protocol

AudioProcessing

static func loadAudio(fromPath audioFilePath: String) -> AVAudioPCMBuffer?

becomes

static func loadAudio(fromPath audioFilePath: String) throws -> AVAudioPCMBuffer

AudioStreamTranscriber

public init(
    audioProcessor: any AudioProcessing, 
    transcriber: any Transcriber, 
    decodingOptions: DecodingOptions, 
    requiredSegmentsForConfirmation: Int = 2, 
    silenceThreshold: Float = 0.3, 
    compressionCheckWindow: Int = 20, 
    useVAD: Bool = true, 
    stateChangeCallback: AudioStreamTranscriberCallback?
)

becomes

public init(
    audioEncoder: any AudioEncoding,
    featureExtractor: any FeatureExtracting,
    segmentSeeker: any SegmentSeeking,
    textDecoder: any TextDecoding,
    tokenizer: any WhisperTokenizer,
    audioProcessor: any AudioProcessing,
    decodingOptions: DecodingOptions,
    requiredSegmentsForConfirmation: Int = 2,
    silenceThreshold: Float = 0.3,
    compressionCheckWindow: Int = 20,
    useVAD: Bool = true,
    stateChangeCallback: AudioStreamTranscriberCallback?
)

TextDecoding

func prepareDecoderInputs(withPrompt initialPrompt: [Int]) -> DecodingInputs?

becomes

func prepareDecoderInputs(withPrompt initialPrompt: [Int]) throws -> DecodingInputs

What's Changed

New Contributors

Full Changelog: v0.5.0...v0.6.0

WhisperKit WhisperKit

WhisperKit

Tests License Supported Swift Version Supported Platforms

WhisperKit is a Swift package that integrates OpenAI's popular Whisper speech recognition model with Apple's CoreML framework for efficient, local inference on Apple devices.

Check out the demo app on TestFlight.

[Blog Post] [Python Tools Repo]

Table of Contents

Installation

Swift Package Manager

WhisperKit can be integrated into your Swift project using the Swift Package Manager.

Prerequisites

  • macOS 14.0 or later.
  • Xcode 15.0 or later.

Steps

  1. Open your Swift project in Xcode.
  2. Navigate to File > Add Package Dependencies....
  3. Enter the package repository URL: https://github.com/argmaxinc/whisperkit.
  4. Choose the version range or specific version.
  5. Click Finish to add WhisperKit to your project.

Homebrew

You can install WhisperKit command line app using Homebrew by running the following command:

brew install whisperkit-cli

Getting Started

To get started with WhisperKit, you need to initialize it in your project.

Quick Example

This example demonstrates how to transcribe a local audio file:

import WhisperKit

// Initialize WhisperKit with default settings
Task {
   let pipe = try? await WhisperKit()
   let transcription = try? await pipe!.transcribe(audioPath: "path/to/your/audio.{wav,mp3,m4a,flac}")?.text
    print(transcription)
}

Model Selection

WhisperKit automatically downloads the recommended model for the device if not specified. You can also select a specific model by passing in the model name:

let pipe = try? await WhisperKit(model: "large-v3")

This method also supports glob search, so you can use wildcards to select a model:

let pipe = try? await WhisperKit(model: "distil*large-v3")

Note that the model search must return a single model from the source repo, otherwise an error will be thrown.

For a list of available models, see our HuggingFace repo.

Generating Models

WhisperKit also comes with the supporting repo whisperkittools which lets you create and deploy your own fine tuned versions of Whisper in CoreML format to HuggingFace. Once generated, they can be loaded by simply changing the repo name to the one used to upload the model:

let pipe = try? await WhisperKit(model: "large-v3", modelRepo: "username/your-model-repo")

Swift CLI

The Swift CLI allows for quick testing and debugging outside of an Xcode project. To install it, run the following:

git clone https://github.com/argmaxinc/whisperkit.git
cd whisperkit

Then, setup the environment and download your desired model.

make setup
make download-model MODEL=large-v3

Note:

  1. This will download only the model specified by MODEL (see what's available in our HuggingFace repo, where we use the prefix openai_whisper-{MODEL})
  2. Before running download-model, make sure git-lfs is installed

If you would like download all available models to your local folder, use this command instead:

make download-models

You can then run them via the CLI with:

swift run whisperkit-cli transcribe --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" --audio-path "path/to/your/audio.{wav,mp3,m4a,flac}" 

Which should print a transcription of the audio file. If you would like to stream the audio directly from a microphone, use:

swift run whisperkit-cli transcribe --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" --stream

Contributing & Roadmap

Our goal is to make WhisperKit better and better over time and we'd love your help! Just search the code for "TODO" for a variety of features that are yet to be built. Please refer to our contribution guidelines for submitting issues, pull requests, and coding standards, where we also have a public roadmap of features we are looking forward to building in the future.

License

WhisperKit is released under the MIT License. See LICENSE for more details.

Citation

If you use WhisperKit for something cool or just find it useful, please drop us a note at info@takeargmax.com!

If you use WhisperKit for academic work, here is the BibTeX:

@misc{whisperkit-argmax,
   title = {WhisperKit},
   author = {Argmax, Inc.},
   year = {2024},
   URL = {https://github.com/argmaxinc/WhisperKit}
}

Description

  • Swift Tools 5.9.0
View More Packages from this Author

Dependencies

Last updated: Sun May 05 2024 13:04:54 GMT-0900 (Hawaii-Aleutian Daylight Time)