The extensive uptake of Artificial Intelligence (AI)-powered tools has opened up a new era of creative possibilities, offering advantages to both prominent tech corporations and ordinary individuals, largely due to their enhanced accessibility. These aforementioned tools enable the generation of synthetic media with minimal effort. Textual inputs can be transformed to mimic someone’s voice or produce digital content that closely resembles authentic media. While these technologies have numerous positive applications such as dubbing and medical purposes, they also pose risks to public information and privacy. Deepfakes, a form of synthetic content created through Deep Learning (DL) techniques, are particularly concerning. They involve appropriating the physical characteristics of a real person to generate new, often misleading content. This aggravates the issues surrounding fake news, as deepfakes can be used to manipulate public perception and credibility, especially when attributed to public figures like politicians. The ease of creating deepfakes has led to a surge in fake news problems, worsened by the rapid consumption of information in today’s society. The problems related to these new public issues are addressed by the MultiMedia Forensic (MMF) community, which guides the research towards detecting deepfakes for public and private security purposes. Despite the results and methods utilized for deepfake spoofing have led to a more comfortable situation concerning the specific field, several difficulties, such as predictability of the detection on unseen data, are still faced by the MMF community. This is the inspiring factor behind the decision and development of this thesis’s main topic: Deepfake Audio Detection. Old architectures for spoofing were based on signal processing techniques. Nowadays modern detectors are based on DL, varying from simple architectures composed only by Convolutional Neural Networks (CNNs) to more complex systems which can take as input raw waveform (RawNet). DL-based architectures are way more precise in a detection scenario because of their ability to capture patterns in the input and recognize them as deepfakes. Although present and functioning, those architectures are far from being perfect, given the constant introduction of new synthetic speech generators. These new generators were not seen during training by the network, so they provide, in the resulting deepfake, unseen artifacts capable of tricking the network. With new generators rapidly becoming available, each producing distinct artifacts in the audio, maintaining consistency and robustness against unfamiliar models necessitates experimentation with varied solutions to address unpredictability.
To address this problem, the system proposed in this thesis, named DeepMetric, aims to make a detector reliable and consistent by utilizing speech deepfakes synthesized with new generative models, among which sev- eral Text-to-Speech (TTS) and Voice Conversion (VC), never seen during training. The main idea is that, instead of considering the output of the detector, we generate support tracks with the same textual and semantic content of the original track and calculate the distance in the latent space, which is used as a discriminant to make detection and improve performance. Distance between feature vectors in the latent space is highly descriptive for finding similarities between the inputs of the model. By comparing the embedding derived from the original input track with those of the deepfakes, we can discern whether the input has been artificially generated. All the deepfake copies contain the same textual component of the original track and the embeddings used to compute the distance are extracted for each track from the penultimate layer of the neural network and represent high-level, non-descriptive features of its input. This method is thought not only to contrast unseen architectures by directly comparing them to the original audio in the latent space but also to put emphasis on a modular setup, allowing for the incorporation of future architectures. Importantly, this approach is specifically implemented during the test phase, ensuring its compatibility with any DL-based detector and any TTS or VC model without being bound to a specific architecture.