esp32 audio streaming : Real-Time AI Gunshot Detection System

A deterministic, edge-computing security device capable of detecting ballistic audio signatures with sub-second latency using the ESP32-S3 and W5500 Ethernet.

COMPONENTS

PROJECT DESCRIPTION

Project Overview

This project is a real-time ballistic audio detection system built using ESP32-S3, W5500 Ethernet, and an Edge Impulse AI model.

Traditional audio-security systems suffer from cloud-processing latency, privacy concerns, and Wi-Fi instability, all of which make sub-second detection unreliable. This project overcomes these limitations by performing on-device Edge AI inference on the ESP32-S3 and using W5500 hardware Ethernet to provide deterministic, jitter-free data transmission.

The device streams 16kHz PCM audio to a central server while continuously running a quantized 1D CNN model in the background; when a gunshot is detected, it injects a high-priority “Magic Packet” into the stream to trigger immediate visual and audible alerts on the receiver.

Through this architecture, the system delivers highly reliable, real-time gunshot detection with consistent low latency, independent of cloud availability or wireless network conditions.

이 프로젝트는 ESP32-S3 + W5500 Ethernet + Edge AI 모델을 활용해 총성(gunshot) 음향을 실시간으로 탐지하는 보안 시스템입니다.
기존의 보안 음향 분석 시스템은 클라우드 의존으로 인한 지연과 프라이버시 문제, 그리고 Wi-Fi 환경에서 발생하는 지터와 패킷 손실 때문에 실시간성이 크게 제한됩니다. 본 프로젝트는 이러한 구조적 한계를 해결하기 위해 ESP32-S3에서 직접 Edge AI 추론을 수행하고, W5500 Ethernet을 통해 결정론적(deterministic) 네트워크 전송을 구현했습니다.
디바이스는 16kHz PCM 오디오 스트림을 서버로 전송하는 동시에 백그라운드에서 1D CNN 모델을 지속적으로 실행하며, 총성이 감지되면 즉시 우선순위가 높은 “Magic Packet”을 삽입하여 수신기에서 시각·청각 경보가 즉각적으로 발생하도록 설계되었습니다.
이 아키텍처는 무선 지연이나 클라우드 응답 속도에 의존하지 않고 안정적이고 초저지연의 실시간 총성 감지 시스템을 가능하게 합니다.

Key Features

On-device Edge AI for real-time gunshot detection
Fully offline processing without cloud dependency
Deterministic, jitter-free Ethernet via W5500
Dual-core separation of audio capture and AI inference
C# receiver app for visualization, recording, and alerts

로컬 Edge AI로 총성을 실시간 분류
클라우드 없이 디바이스 자체 처리
W5500 Ethernet으로 안정적·지터 없는 전송
듀얼 코어로 오디오 입력과 AI 작업 분리
C# 수신기 앱으로 시각화·녹음·경보 제공

Hardware Architecture

The system is built on the Producer-Consumer model using FreeRTOS .

Component	Model	Purpose	Connection
MCU	ESP32-S3	Main processing unit (240MHz, Dual Core)	N/A
Microphone	INMP441	High-fidelity audio capture (MEMS, 16kHz)	I2S Interface
Ethernet	W5500	Network transmission (UDP/RTP)	SPI Interface
User Interface	LED & Button	Status indication and recording control	GPIO

AI Pipeline

The AI model was trained using Edge Impulse on a custom dataset comprising three distinct classes .

Dataset Strategy

Gunshot: Acoustic proxies (balloon pops, heavy book slams, ruler snaps) to simulate high-pressure release and ballistic shockwaves .
Background: Ambient noise from quiet rooms, empty hallways, and busy corridors .
Noise/Interference: Adversarial examples including shouting, clapping, and door slams to prevent false positives .

해당 AI 모델은 세 가지 서로 다른 클래스로 구성된 맞춤형 데이터 세트를 사용하여 Edge Impulse 로 학습되었습니다 .
데이터셋 전략
총성: 고압 방출 및 탄도 충격파를 시뮬레이션하기 위해 음향 대리물(풍선 터지는 소리, 무거운 책이 쾅 닫히는 소리, 자가 부러지는 소리)을 사용합니다.
배경: 조용한 방, 텅 빈 복도, 붐비는 복도에서 발생하는 주변 소음.
소음/간섭: 오탐을 방지하기 위해 고함, 박수, 문 쾅 닫는 소리 등의 적대적 사례를 포함합니다.

Live Demo

Watch the system filter out false positives (speech, clapping) and successfully detect simulated gunshots (balloon pops).

시스템이 오탐지(말소리, 박수 소리)를 걸러내고 모의 총격(풍선 터짐)을 성공적으로 감지하는 것을 확인해 보세요.

Summary

Cloud-based audio analysis often introduces unacceptable latency for safety-critical systems. This project solves that problem by implementing a standalone, edge-based security device capable of detecting gunshot acoustic signatures in real-time.

Using the ESP32-S3 and FreeRTOS, the system captures stereo audio via I2S, processes it locally using a 1D Convolutional Neural Network (Edge Impulse), and alerts a central server via Ethernet within sub-second latency.

클라우드 기반 오디오 분석은 안전에 중요한 시스템에 허용할 수 없는 지연 시간을 유발하는 경우가 많습니다. 본 프로젝트는 총격음 신호를 실시간으로 감지할 수 있는 독립형 엣지 기반 보안 장치를 구현하여 이러한 문제를 해결합니다.
ESP32-S3 와 FreeRTOS를 사용하는 이 시스템은 I2S를 통해 스테레오 오디오를 캡처하고, 1D 컨볼루션 신경망(Edge Impulse)을 사용하여 로컬에서 처리한 후, 1 초 미만의 지연 시간 내에 이더넷을 통해 중앙 서버에 알림을 보냅니다 .

Component	Model	Purpose	Connection
MCU	ESP32-S3	Main processing unit (240MHz, Dual Core)	N/A
Microphone	INMP441	High-fidelity audio capture (MEMS, 16kHz)	I2S Interface
Ethernet	W5500	Network transmission (UDP/RTP)	SPI Interface
User Interface	LED & Button	Status indication and recording control	GPIO

Documents

Comments Write