- Swift Package Manager Support: This package has been adapted to support SPM for integration with the Speak iOS application.
- Package Structure:
- Configured as a static library with two targets:
VoiceActivityDetectorandlibfvad - The C-based
libfvadcomponent usespublicHeadersPath: "include"to expose necessary header files - Minimum iOS version set to iOS 15
- Configured as a static library with two targets:
This library is a dependency for the Speak iOS application. It's used to detect voice activity in audio streams, which is essential for the app's speech recognition functionality.
- This package is integrated via Swift Package Manager in the main Speak iOS app
- The voice activity detector is performance-critical for the app's speech recognition pipeline
- Test thoroughly when updating this dependency as it may affect speech detection sensitivity
This is a Swift/Objective-C interface to the WebRTC Voice Activity Detector (VAD).
A VAD classifies a piece of audio data as being voiced or unvoiced. It can be useful for telephony and speech recognition.
The VAD that Google developed for the WebRTC project is reportedly one of the best available, being fast, modern and free.
The VAD engine simply work only with singed 16 bit, single channel PCM.
Supported bitrates are:
- 8000Hz
- 16000Hz
- 32000Hz
- 48000Hz
Note that internally all processing will be done 8000Hz. input data in higher sample rates will just be downsampled first.
import VoiceActivityDetector
let voiceActivityDetector = VoiceActivityDetector(sampleRate: 8000,
agressiveness: .veryAggressive)
func didReceiveSampleBuffer(_ sampleBuffer: CMSampleBuffer) {
// activities: [VoiceActivityDetector.VoiceActivityInfo]?
let activities = voiceActivityDetector(sampleBuffer: sampleBuffer, byEachMilliSec: 10)!
// ...
}For usage with a microphone, see Example. And against an audio file, see Test code.
init?()
convenience init?(sampleRate: Int = 8000, agressiveness: DetectionAgressiveness = .quality)
convenience init?(agressiveness: DetectionAgressiveness = .quality) {Instanciate VoiceActivityDetector.
var agressiveness: DetectionAgressivenessVAD operating "aggressiveness" mode.
.qualityThe default value; normal voice detection mode. Suitable for high bitrate, low-noise data. May classify noise as voice, too..lowBitRateDetection mode optimised for low-bitrate audio..aggressiveDetection mode best suited for somewhat noisy, lower quality audio..veryAggressiveDetection mode with lowest miss-rate. Works well for most inputs.
var sampleRate: IntSample rate in Hz for VAD operations.
Valid values are 8000, 16000, 32000 and 48000. The default is 8000.
Note that internally all processing will be done 8000Hz. input data in higher sample rates will just be downsampled first.
func reset()Reinitializes a VAD instance, clearing all state and resetting mode and sample rate to defaults.
func detect(frames: UnsafePointer<Int16>, count: Int) -> VoiceActivityCalculates a VAD decision for an audio duration.
frames is an array of signed 16-bit samples.
count specifies count of frames.
Since internal processor supports only counts of 10, 20 or 30 ms,
so for example at 8 kHz, count must be either 80, 160 or 240.
Returns a VAD decision.
Under the hood, the VAD engine calculates energy powers in six frequency bands between 80-4KHz from signal data flow and guesses chance of voice activity state in a input duration. So, its decision should be more accurate by sequencial detection than one-shot or periodic ones.
func detect(frames: UnsafePointer<Int16>, lengthInMilliSec ms: Int) -> VoiceActivityms specifies processing duration in milliseconds.
The should be either 10, 20 or 30 (ms).
public func detect(sampleBuffer: CMSampleBuffer,
byEachMilliSec ms: Int,
offset: Int = 0,
duration: Int? = nil) -> [VoiceActivityInfo]? {Calculates VAD decisions among a sample buffer.
sampleBuffer is an audio buffer to be inspected.
ms specifies processing duration in milliseconds.
offset controlls offset time in milliseconds from where to start VAD.
duration controlls total VAD duration in milliseconds.
Returns an array of VAD decision information.
timestamp: IntElapse time from the beginning of the sample buffer, in milliseconds.presentationTimestamp: CMTimeThis isCMSampleBuffer.presentationTime+timestamp, which may represent a timestamp in entire of a recording session.voiceActivity: VoiceActivitya VAD decision.
To run the example project, clone the repo, and run pod install from the Example directory first.
VoiceActivityDetector is available through CocoaPods. To install it, simply add the following line to your Podfile:
pod 'VoiceActivityDetector'reedom, tohru@reedom.com
VoiceActivityDetector is available under the MIT license. See the LICENSE file for more info.