Intent Detection from Dog Vocalizations

Abstract

Historically, animal communication has been out of reach due to the inability to understand the constituent parts of their speech signals. Recently, through efforts of dissecting codas in sperm whales, applying human speech models to dog vocalizations, and other efforts with dolphins, parrots, and other species, we are rapidly identifying clear signal markers in animal bioacoustics. We have similarly developed a pipeline to extract dog vocalizations, segment each extraction into individual signals, and for each signal we devised a system to extract repertoire elements.

Introduction

Repertoire detection in animal vocalizations is a new and active field of development involving splitting audio into units of speech and then clustering based on similarity. Most of the efforts in this space have revolved around detecting phonemes in human speech. There are some early attempts at determining units in animal vocalization, including automated bark classification and contextual and combinatorial structure in sperm whale vocalisations.

Our approach to dog repertoire detection uses a custom framework to process dog barks and vocalizations. YAMNet operates on a sliding window basis. Successive windows that contain positive dog bark detections are concatenated to yield bark sequence segments containing one or more dog sounds.

The segments are converted to Mel-Frequency Cepstral Coefficient (MFCC)-based feature vectors, which are used as input in ML models such as KNNs and SVMs that output predictions for dog identity and bark context: playful, agitation, or lonely. The same feature vectors enable fingerprinting by finding the nearest labeled neighbor in the user database.

Additionally, we build a threshold-based energy detector to obtain fine-grained bark segments for each bark in the sequence segment. This provides the ability to count user barks. We also rely on an unsupervised KMeans-based approach to determine dog repertoire cardinality: the number of unique call types for a given dog.

Ground truth is established through iterative feedback from users. By observing routines in a specific dog's behavior across repeated app usage, and by requesting user feedback on audio vocalizations paired with video, we generate labels from a mix of Video LLMs and acoustic models that the system can train against over time.

Communication Pipelines

The system blends video and audio models that can detect and determine aspects of dog meaning. Visual context, daily routines, and user feedback are modeled alongside acoustic signals.

flowchart TD
  A["Audio Recording"] --> B["YAMNet Detection"]
  B -->|Human Voice| C["Whisper Transcription"]
  B -->|Dog Barks| D["Segmentation"]
  C -->|Text Embeddings| E["Multi-Modal ML Model"]
  D -->|Acoustic Embeddings| E
  D -->|Context Labels| F["Context Prediction Model"]
  D -->|MFCC| G["MFCC Embeddings"]
  D -->|OpenAI Embeddings| H["Context Distributions"]
  D --> I["Energy Detector"]
  G --> F
  G --> P["ID Prediction Model"]
  F --> K["Predicted Context"]
  P --> L["Predicted ID"]
  H -->|Pairwise Similarity| M["Fingerprinting"]
  I --> N["Repertoire Cardinality"]
  I --> O["Bark Counter"]
  class A source;
  class E,K,L,M,N,O output;

Figure 1 · YAMNet-driven segmentation feeds parallel acoustic, contextual, and energy-based predictors.

Vocalization Categories

We created pipelines that understand emotional context, repertoire analysis of barks, individual ID from acoustic signatures, associations to actions, and other variables that play into intent.

We have also derived several categories of vocalizations:

Health-related signals, such as "ouch" or "in duress."
Attention-seeking signals, such as "look over here" or "I am near what I want."
Reactive vocalizations, such as responses to a doorbell or skateboard.
Intent-driven vocalizations, such as hunting-dog barks that differ by encountered species.

Methods

Dog Bark Embeddings

Segmented barks are converted to power mel-spectrograms with a 50 ms FFT size and 10 ms hop length. We obtain 20 MFCC features from the power mel-spectrograms and pool them over time using mean, standard deviation, minimum, and maximum, resulting in a handcrafted 80-dimensional feature embedding.

Dog Bark Similarity

The current approach uses nearest-neighbor selection based on a similarity metric computed using MFCC features: 96% for L1 distance and 93% for cosine similarity in the source draft. For a given dog, new bark segments are converted to MFCC feature vectors; the similarity metric is computed across the existing MFCC database; and the nearest neighbor is selected to fingerprint the new bark and extract context information.

The previous approach used nearest-neighbor selection based on cosine similarity with PANNs/YAMNet embeddings.

Results

A two-dimensional projection of the MFCC-based bark embeddings reveals well-separated clusters per dog, with sub-clusters that correspond to distinct call types within an individual's repertoire. This separation supports both the fingerprinting (per-dog identification) and repertoire cardinality estimates produced by the unsupervised KMeans pass.

Results are organized around the continuous feedback loop: users provide feedback to the system, and the machine-learning systems are improved against that feedback over time.

Discussion

This paper articulates an approach that allows a continuous feedback loop from users into the model pipeline. The system depends on preprocessing steps that separate bark detection, segmentation, feature extraction, similarity search, and context estimation.

We also believe there is an inaccuracy in how the Dr Dolittle competition presents this work. It runs counter to language research to assume that a new language between animals and people will emerge spontaneously. Even when new languages are added within LLMs, the current state of the art primes the system with a dictionary alongside a few hundred sample translations (see arXiv:2402.18025). We expect a similar pattern for expanding animal cognition through language development — a field that has been spearheaded largely outside of academic labs and in public. Without this adjustment, the competition risks not only failing to reach interesting forms of communication but setting itself up for failure.

Data Availability

The dataset uses the McCowan and Yin bark dataset.

References

Sperm whale vocalisations.
[2402.18025] Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions.
ORCA-SPY enables killer whale sound source simulation, detection, classification, and localization using an integrated deep-learning-based segmentation.

Author Information

Sarama Inc, San Francisco, CA 94107.

Peter C. Bermant and Praful Mathur.

All aspects of the work, including task setting, data processing, machine learning, article writing, and figure making, are performed by P.C.B. or P.M.

Correspondence to Peter C. Bermant.

Intent detection from dog vocalizations