Read our new preprint in bioRxiv on enhancing sound recognition with semantics. We found that integrating NLP embeddings into CNNs enhances the network's sound recognition performance and aligns it with human auditory perception: Bridging Auditory Perception and Natural Language Processing with Semantically informed Deep Neural Networks. Michele Esposito, Giancarlo Valente, Yenisel Plasencia-Calaña, Michel Dumontier, Bruno L. Giordano, and Elia Formisano,
Read our new preprint in arXiv on a novel metric to evaluate and compare audio captions generated by humans or automated models: Gijs Wijngaard and Elia Formisano and Bruno L. Giordano and Michel Dumontier, ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds. (2024) arXiv, 2403.18572. https://arxiv.org/abs/2403.18572. You may also want to check Gijs's HuggingFace model associated with this article and that classifs tokens of sound caption with auditory semantic tags (Who/What/How/Where): here