The hypothesis
A small CNN-2D over Mel-spectrogram features should beat a classical SVM/RandomForest stack on UrbanSound8K, but the gap is smaller than papers imply once you control for preprocessing parity. The point of the comparison is to see whether DSP preprocessing (filter banks, framing, windowing) is a load-bearing component or a habit.
Stack & architecture
- Python + librosa for audio I/O and spectrogram extraction.
- scikit-learn for the classical SVM + RandomForest baselines.
- PyTorch for the CNN-2D head.
- Two parallel pipelines: Pipeline A (raw audio direct to model) and Pipeline B (filter-bank + frame + window).
What I learned
Concept-stage. Final-project bookkeeping (DSP-501) — keeping the writeup honest about what the comparison actually measures is harder than training the model.