歌声风格转换论文分享:2509.15629v1
前言:本文分享一篇探究歌声风格转换的论文,全文五页,只提供第一页的英文及其翻译,详情请移步,链接:https://arxiv.org/abs/2509.15629v12025年歌声转换挑战赛:从歌手身份转换到演唱风格转换We present the findings of the latest iteration of the Singing Voice Conversion Challenge,
前言:本文分享一篇探究歌声风格转换的论文,全文五页,只提供第一页的英文及其翻译,详情请移步,链接:https://arxiv.org/abs/2509.15629v1
The Singing Voice Conversion Challenge 2025: From Singer Identity Conversion to Singing Style Conversion
2025年歌声转换挑战赛:从歌手身份转换到演唱风格转换
Abstract
摘要
We present the findings of the latest iteration of the Singing Voice Conversion Challenge, a scientific event aiming to compare and understand different voice conversion systems in a controlled environment. Compared to previous iterations which solely focused on converting the singer identity, this year we also focused on converting the singing style of the singer. To create a controlled environment and thorough evaluations, we developed a new challenge database, introduced two tasks, open-sourced baselines, and conducted large-scale crowd-sourced listening tests and objective evaluations. The challenge was ran for two months and in total we evaluated 26 different systems. The results of the large-scale crowd-sourced listening test showed that top systems had comparable singer identity scores to ground truth samples. However, modeling the singing style and consequently achieving high naturalness still remains a challenge in this task, primarily due to the difficulty in modeling dynamic information in breathy, glissando, and vibrato singing styles.
我们展示了最新一届歌声转换挑战赛的研究结果,该科学活动旨在受控环境中比较和理解不同的歌声转换系统。与往届仅聚焦于转换歌手音色的设定不同,本届赛事还重点关注歌手演唱风格的转换。为构建受控实验环境并开展全面评估,我们开发了全新的挑战数据库,设立了两项任务,开源了基线系统,并组织了大规模众包听感测试与客观指标分析。本次挑战赛历时两个月,共计评估了26套不同系统。大规模众包听感测试结果表明,顶尖系统的歌手音色相似度得分已逼近真实样本。然而,由于气声、滑音和颤音等演唱风格中动态信息建模的难度,如何有效模拟演唱风格进而实现高度的自然度,仍是当前任务面临的主要挑战。
标注词汇解释:
- Findings: (n.) 研究发现,结果 六级/雅思
- Iteration: (n.) 迭代,版本 六级/托福
- Scientific: (adj.) 科学的 四级
- Controlled environment: (n.) 受控环境 四级
- Solely: (adv.) 仅仅,唯一地 六级/托福
- Identity: (n.) 身份,特性 四级
- Thorough: (adj.) 彻底的,全面的 四级
- Evaluations: (n.) 评估 四级
- Database: (n.) 数据库 四级
- Introduced: (v.) 引入,介绍 四级
- Open-sourced: (adj.) 开源的 专有名词/计算机领域高频
- Baselines: (n.) 基线,基准 六级/托福 (在计算机领域很常见)
- Conducted: (v.) 进行,实施 四级
- Large-scale: (adj.) 大规模的 四级
- Crowd-sourced: (adj.) 众包的 专有名词/互联网领域高频
- Objective evaluations: (n.) 客观评估 四级
- Ran: (v.) 运行,操作 (run的过去式,在此语境下) 四级
- Evaluated: (v.) 评估 四级
- Comparable: (adj.) 可比的,相当的 六级/托福
- Ground truth: (n.) 真实值,基准事实 专有名词/科研领域高频
- Modeling: (v./n.) 建模,模拟 六级/托福
- Consequently: (adv.) 因此,所以 四级
- Naturalness: (n.) 自然度 四级 (由natural衍生)
- Primarily: (adv.) 主要地 四级
- Difficulty: (n.) 困难 四级
- Dynamic: (adj.) 动态的 四级
- Breathy: (adj.) 带呼吸声的,气声的 专有名词/音乐领域
- Glissando: (n.) 滑音 专有名词/音乐领域
- Vibrato: (n.) 颤音 专有名词/音乐领域
Lester Phillip Violeta({}^{1}), Xueyao Zhang({}^{2}), Jiatong Shi({}^{3}), Yusuke Yasuda({}^{4}),
Wen-Chin Huang({}^{1}), Zhizheng Wu({}^{2}), Tomoki Toda({}^{1}) ({}^{1})Nagoya University, Japan, ({}^{2})The Chinese University of Hong Kong, Shenzhen, China,
({}^{3})Carnegie Mellon University, USA, ({}^{4})National Institute of Informatics, Japan
莱斯特·菲利普·维奥莱塔({}^{1}), 张雪瑶({}^{2}), 史佳彤({}^{3}), 安田祐介({}^{4}), 黄文锦({}^{1}, 吴志政({}^{2}), 户田智基({}^{1}) ({}^{1})名古屋大学, 日本, ({}^{2})香港中文大学(深圳), 中国, ({}^{3})卡内基梅隆大学, 美国, ({}^{4})国立情报学研究所, 日本
singing voice conversion challenge, singing style conversion, voice conversion
歌声转换挑战赛,演唱风格转换,语音转换
1 Introduction
1 引言
Voice conversion (VC) [1], the task known as to convert one kind of speech to another without changing the linguistic contents, has undergone rapid changes recently. This technology has several use cases in the form of entertainment and medical solutions. These use cases range from real-time voice conversion, singing voice conversion, accent conversion, to pathological speech enhancement. Regardless of the wide variety of use cases, the underlying technology of state-of-the-art architectures are almost shared with each other, which either use a parallel sequence-to-sequence [2] or recognition-synthesis [3] framework. With this in mind, a benchmarking solution was in need to find the most effective solutions among several methods [1, 2, 3].
语音转换(VC)[1] 是一项在不改变语言内容的情况下,将一种语音转换为另一种语音的任务,近年来经历了快速的发展。该技术在娱乐和医疗解决方案方面有多种应用场景,包括实时语音转换、歌声转换、口音转换以及病理语音增强等。尽管应用场景多种多样,但最先进架构的核心技术几乎是共享的,它们要么使用并行序列到序列[2],要么使用识别-合成[3]框架。考虑到这一点,需要一个基准测试方案来在多种方法[1, 2, 3]中找到最有效的解决方案。
标注词汇解释:
- Linguistic: (adj.) 语言的,语言学的 六级/托福
- Undergone: (v.) 经历,经受 (undergo的过去分词) 四级
- Medical: (adj.) 医疗的 四级
- Accent: (n.) 口音 四级
- Pathological: (adj.) 病理学的 GRE/专业词汇
- Enhancement: (n.) 增强,提高 六级/托福
- Regardless: (adv.) 不顾,不管 四级
- Variety: (n.) 种类,多样化 四级
- Underlying: (adj.) 根本的,基础的 六级/托福
- State-of-the-art: (adj.) 最先进的,顶尖水平的 六级/托福/雅思 (常用在科研论文中)
- Architectures: (n.) 架构,体系结构 六级/托福 (计算机领域)
- Parallel: (adj.) 平行的,并行的 四级
- Sequence-to-sequence: (n.) 序列到序列 专有名词/计算机领域
- Recognition-synthesis: (n.) 识别-合成 专有名词/计算机领域
- Framework: (n.) 框架,结构 四级
- Benchmarking: (n.) 基准测试 六级/托福 (计算机领域)
- Effective: (adj.) 有效的 四级
The Voice Conversion Challenge (VCC) series was launched in 2016 [4] with the goal of comparing several VC approaches in a controlled task setting, with a dedicated database and large-scale evaluation. This was then expanded to more difficult settings such as non-parallel conversion and cross-lingual VC in VCC 2018 [5] and VCC 2020 [6], respectively. Finally, VCC 2020 showed that top systems could already synthesize at a naturalness and similarity score similar to ground-truth recordings.
语音转换挑战赛(VCC)系列始于2016年[4],其目标是在一个受控的任务设置中,通过专用的数据库和大规模评估,来比较几种语音转换方法。随后,该挑战赛扩展到更困难的设置,例如分别在VCC 2018[5]和VCC 2020[6]中引入了非并行转换和跨语言语音转换。最终,VCC 2020的结果表明,顶尖系统合成的语音在自然度和相似度得分上已经接近真实录音。
标注词汇解释:
- Launched: (v.) 发起,启动 四级
- Approaches: (n.) 方法,途径 四级
- Controlled: (adj.) 受控的 四级
- Dedicated: (adj.) 专用的,专门的 四级
- Large-scale: (adj.) 大规模的 四级
- Expanded: (v.) 扩展,扩大 四级
- Difficult: (adj.) 困难的 四级
- Non-parallel: (adj.) 非并行的 专有名词/计算机领域
- Cross-lingual: (adj.) 跨语言的 专有名词/计算机领域
- Respectively: (adv.) 分别地 六级/托福
- Synthesize: (v.) 合成 六级/托福
- Naturalness: (n.) 自然度 四级 (由natural衍生)
- Similarity: (n.) 相似度 四级
- Similar: (adj.) 相似的 四级
- Ground-truth: (n.) 真实值,基准事实 专有名词/科研领域高频
With the rapid advancements of speech technology, VCC shifted its focus to singing voices, and Singing Voice Conversion Challenge (SVCC) 2023 [7] was held. Two new tasks were introduced, which were in-domain SVC and cross-domain SVC. To evaluate the naturalness and similarity, large-scale listening tests were conducted. For naturalness, the 5-scale mean opinion score (MOS) test was performed, and for similarity, a 4-scale speaker-likeness test was performed. The similarity scores were then calculated by the percentage of a system’s converted samples that were judged to be the same as the target speakers. Similar to VCC 2020, the large-scale listening tests show human-level naturalness was achieved by the top system, showing the huge advancements in the field of generative speech modeling. In particular, systems employing the VAE-GAN [8, 9] architecture and DSPGAN-based vocoders [10] performed very well in naturalness. However, unlike VCC 2020, no team was able to obtain a similarity score as high as the target speakers, showing that singing voices may have some intrinsic features that were not considered in the previous challenges.
随着语音技术的快速进步,VCC将其重点转向了歌声,并举办了2023年歌声转换挑战赛(SVCC)[7]。会上引入了两项新任务:域内歌声转换和跨域歌声转换。为了评估自然度和相似度,进行了大规模听感测试。对于自然度,采用了5分制平均意见得分(MOS)测试;对于相似度,采用了4分制说话人相似度测试。相似度得分是通过系统转换的样本被判定与目标歌手相同的百分比来计算的。与VCC 2020类似,大规模听感测试表明顶级系统达到了人类水平的自然度,显示了生成式语音建模领域的巨大进步。特别是,采用VAE-GAN[8, 9]架构和基于DSPGAN的声码器[10]的系统在自然度方面表现非常好。然而,与VCC 2020不同,没有团队能够获得与目标歌手一样高的相似度得分,这表明歌声可能具有一些先前挑战中未考虑到的内在特征。
标注词汇解释:
- Advancements: (n.) 进步,进展 六级/托福
- Shifted: (v.) 转变,转移 四级
- Introduced: (v.) 引入,介绍 四级
- In-domain: (adj.) 域内的 专有名词/计算机领域
- Cross-domain: (adj.) 跨域的 专有名词/计算机领域
- Evaluate: (v.) 评估 四级
- Naturalness: (n.) 自然度 四级 (由natural衍生)
- Similarity: (n.) 相似度 四级
- Large-scale: (adj.) 大规模的 四级
- Conducted: (v.) 进行,实施 四级
- Performed: (v.) 执行,进行 四级
- Speaker-likeness: (n.) 说话人相似度 专有名词/语音领域
- Calculated: (v.) 计算 四级
- Percentage: (n.) 百分比 四级
- Converted: (v.) 转换 四级
- Judged: (v.) 判断,评定 四级
- Similar: (adj.) 相似的 四级
- Human-level: (adj.) 人类水平的 专有名词/AI领域
- Achieved: (v.) 达到,获得 四级
- Generative: (adj.) 生成的 六级/托福/AI领域
- Modeling: (v./n.) 建模,模拟 六级/托福
- Particular: (adj.) 特别的,特定的 四级
- Employing: (v.) 采用,使用 四级
- Architecture: (n.) 架构 六级/托福 (计算机领域)
- Vocoders: (n.) 声码器 专有名词/语音领域
- Performed: (v.) 表现,执行 四级
- Unlike: (prep.) 与…不同 四级
- Obtain: (v.) 获得 四级
- Intrinsic: (adj.) 固有的,内在的 六级/托福/GRE
- Considered: (v.) 考虑 四级
As singing voices contain more sophisticated and specific features than singer identity, controlling the singing style has more practical applications and needs to be investigated to advance the field of voice synthesis. With this in mind, we launch SVCC 2025. Compared to previous iterations which solely focused on converting the voice identity, this year’s iteration also focuses on singing style conversion (SSC). Compared to SVC which is limited to converting the global timbre and static features, SSC is a harder task as both the static and dynamic features of singing voices also need to be controlled. We introduce two new tasks: in-domain SSC and zero-shot SSC and develop a dedicated challenge dataset. Moreover, we pro
由于歌声包含比歌手身份更复杂和具体的特征,控制演唱风格具有更实际的应用价值,并且需要加以研究以推动语音合成领域的进步。考虑到这一点,我们发起了SVCC 2025。与以往只专注于转换声音身份的迭代相比,今年的迭代还侧重于演唱风格转换(SSC)。与仅限于转换全局音色和静态特征的SVC相比,SSC是一项更困难的任务,因为歌声的静态和动态特征都需要被控制。我们引入了两项新任务:域内演唱风格转换和零样本演唱风格转换,并开发了一个专用的挑战数据集。
标注词汇解释:
- Contain: (v.) 包含 四级
- Sophisticated: (adj.) 复杂的,精密的 六级/托福
- Specific: (adj.) 特定的,具体的 四级
- Identity: (n.) 身份,特性 四级
- Controlling: (v.) 控制 四级
- Practical: (adj.) 实际的,实用的 四级
- Applications: (n.) 应用,应用程序 四级
- Investigated: (v.) 调查,研究 四级
- Advance: (v.) 推进,促进 四级
- Synthesis: (n.) 合成 六级/托福
- Launch: (v.) 发起,启动 四级
- Compared: (v.) 比较 四级
- Iterations: (n.) 迭代,版本 六级/托福
- Solely: (adv.) 仅仅,唯一地 六级/托福
- Converting: (v.) 转换 四级
- Limited: (adj.) 有限的 四级
- Global: (adj.) 全局的,全球的 四级
- Timbre: (n.) 音色 专有名词/音乐领域
- Static: (adj.) 静态的 四级
- Dynamic: (adj.) 动态的 四级
- Controlled: (v.) 控制 四级
- Introduce: (v.) 引入,介绍 四级
- In-domain: (adj.) 域内的 专有名词/计算机领域
- Zero-shot: (adj.) 零样本的 专有名词/AI领域
- Dedicated: (adj.) 专用的,专门的 四级
- Dataset: (n.) 数据集 四级/计算机领域
本次分享到此结束。
更多推荐

所有评论(0)