Building a keystroke audio classifier

2025/06/10

Table of Contents

Introduction

This short project explores machine learning methods for classifying keystrokes based on audio signal. The goal is to determine which key the user is pressing only by analyzing the sound it makes. First, we will look at the data collection method that captures keyboard sounds with precise cuts. Then, we will test different ML algorithms: MFCC features with traditional classifiers, a 1D CNN on raw audio, and neural networks on FFT spectrograms.

Data collection

The keystroke audio collection tool is built with a dual-threaded architecture for precision:

Using the timestamped key press event, a ~400 ms window of audio (100 ms before and 300 ms after the keypress) is extracted from the buffer and saved. Each sample is labeled with the corresponding key.

This architecture ensures that every keystroke is paired with a clean, time-aligned audio snippet.

You can find the full source code for the keystroke audio capture tool here.

A Look at the raw data

Time DomainFrequency Domain

In the time domain, the signal shows a sharp initial spike marking the moment of key contact, followed by a secondary smaller peak that likely represents the mechanical rebound.

The frequency domain shows a dominant low-frequency content below 3 kHz, we see a primary peak near DC and a secondary peak around 1 kHz.

MFCC feature extraction with traditional classifiers

The first approach uses Mel-Frequency Cepstral Coefficients (MFCCs) as feature vectors, implementing the mathematical pipeline from scratch:

These 13-dimensional feature vectors are then fed to traditional classifiers including K-Nearest Neighbors, SVM, Decision Tree, and Random Forest. Evaluation is performed through 5-fold stratified cross-validation with randomized hyperparameter search.

Cross-validation results:

ModelTrain AccuracyTest Accuracy
KNN1.0000 ± 0.00000.8084 ± 0.0051
SVM0.9510 ± 0.00320.8588 ± 0.0103
Decision Tree0.6098 ± 0.00820.5255 ± 0.0080
Random Forest0.9980 ± 0.00010.7913 ± 0.0097

SVM achieves the best test accuracy (~86%) with reasonable training accuracy, indicating good generalization.

You can find the full source code for this approach here.

One-dimensional convolutional neural network on raw audio

The second approach uses a 1D Convolutional Neural Network (CNN) to learn features directly from the raw audio waveform.

The architecture consists of two convolutional blocks followed by a fully connected classifier. The first convolutional layer uses a single filter with kernel size 128 to capture broad temporal patterns in the 300 ms audio segments, followed by max pooling with stride 32. The second layer applies 8 filters of size 64 to extract more complex hierarchical features, again followed by max pooling. The resulting feature maps are flattened and fed to a dense output layer with 27 neurons (one per key class).

Despite its simplicity, with only 3,268 parameters, this lightweight architecture achieves 95% validation accuracy across all keystroke classes.

You can find the full source code here.

Feedforward neural network on frequency spectrum

In this approach, raw audio signals are transformed into the frequency domain using the Fast Fourier Transform (FFT), converting each 300 ms keystroke sample into a 3,307-dimensional vector representing the spectral magnitude.

These high-dimensional vectors are passed through a fully connected neural network consisting of:

Despite its shallow structure and modest parameter count (106,747 trainable weights), this model achieves 96.5% validation accuracy.

You can find the full source code here.

Conclusion

This project demonstrates that keystroke classification from audio is a highly feasible task using a variety of machine learning techniques. These results highlight the potential security and privacy implications of acoustic side-channel attacks.