Artistic creation has continuously benefited from technological advances in various scientific domains. In particular, progresses in fields such as signal processing or computer science have provided artists with diverse tools to shape, transform and manipulate materials in multiple ways. In this vein, deep neural networks have witnessed noticeable advances in the last decade, leading to sudden leaps in performance on many complex tasks. One of the most representative examples of this trend can be found in generative models, which are now able to synthesize new data mimicking intricate patterns of complex, real-world information. In the audio domain, models such as RAVE or AFTER have gained considerable interest due to their both faithful and real-time synthesis, as well as their expressive latent spaces, allowing users to dynamically intervene in the generation and truly play with the model. Such applications open the door to an entirely novel family of musical instruments, exploiting the possibilities offered by neural audio generation. A significant step towards this direction would be to embed these models into dedicated synthesizers, which would facilitate their integration in existing musical setups, as well as providing composers with more playful and exploratory ways to interact with. Unfortunately, improvements in deep learning have come at the cost of exploding computational costs, illustrated by the dramatic increase in both number of parameters and operations in modern architectures. This represents a significant obstacle in exploiting powerful neural networks for on-device applications and prevents any integration on constrained hardware. Concomitantly, many evidences point towards the fact that modern neural networks are overparameterized, meaning that most of their parameters only marginally contribute to their prediction. Consequently, this suggests that these models could potentially achieve similar performance while being more compute-efficient, by removing the excess parameters and keeping only the most important. In this thesis, we aim at developing a general method to identify units within neural networks which can be removed without altering the generalization performance of the model. Our goal is to obtain lightweight neural networks for audio applications whose computational cost makes them embeddable on devices with limited resources. To this end, we hypothesize that the two major effects of overparameterization lies in the high level of redundancy within intermediate representations and in the over-specialization of certain units. We propose a method to quantify these concepts by measuring the similarity between feature extractors and analyzing the variance of their activations. We use these metrics to analyze the layers of large audio foundation models, and find that these models contain high level of redundant and highly-specific units. This led us to develop a learning-based trimming strategy, allowing to extract small sub-networks from such models when used for precise downstream tasks and type of data. Then, we adapt this method to audio generative models, and show that the resulting smaller models are just as capable of high-quality generation as the larger ones, while being suited for real-time synthesis on embedded hardware. We validate this last point by prototyping a neural audio synthesizer, taking advantage of both the creative possibilities offered by deep generative models and the performative benefits of electronic instruments.
Building similarity graph...
Analyzing shared references across papers
Loading...
David Genova (Mon,) studied this question.
David Genova
Building similarity graph...
Analyzing shared references across papers
Loading...