Compare commits


6 Commits

Author SHA1 Message Date
Geraint-Dou dda0d28c89
Update 2023-03-11 17:00:36 +08:00
红血球AE3803 df561ea1d4
Update 2023-03-11 14:25:47 +09:00
红血球AE3803 10763fe6a0
Update Some Links 2023-03-11 14:23:24 +09:00
红血球AE3803 d696fc16ad
Update Some Links 2023-03-11 14:19:53 +09:00
红血球AE3803 c201080973
Update to English 2023-03-11 14:07:16 +09:00
红血球AE3803 4400fa45f2
Create 2023-03-11 13:11:14 +09:00
2 changed files with 336 additions and 157 deletions

View File

@ -1,157 +1,179 @@
# SoftVC VITS Singing Voice Conversion
## 使用规约
1. 本项目是基于学术交流目的建立,仅供交流与学习使用,并非为生产环境准备,请自行解决数据集的授权问题,任何由于使用非授权数据集进行训练造成的问题,需自行承担全部责任和一切后果!
2. 任何发布到视频平台的基于sovits制作的视频都必须要在简介明确指明用于变声器转换的输入源歌声、音频例如使用他人发布的视频/音频,通过分离的人声作为输入源进行转换的,必须要给出明确的原视频、音乐链接;若使用是自己的人声,或是使用其他歌声合成引擎合成的声音作为输入源进行转换的,也必须在简介加以说明。
3. 由输入源造成的侵权问题需自行承担全部责任和一切后果。使用其他商用歌声合成软件作为输入源时,请确保遵守该软件的使用条例,注意,许多歌声合成引擎使用条例中明确指明不可用于输入源进行转换!
4. 继续使用视为已同意本仓库README所述相关条例本仓库README已进行劝导义务不对后续可能存在问题负责。
5. 如将本仓库代码二次分发,或将由此项目产出的任何结果公开发表 (包括但不限于视频网站投稿),请注明原作者及代码来源 (此仓库)。
6. 如果将此项目用于任何其他企划,请提前联系并告知本仓库作者,十分感谢。
## update
> 更新了4.0-v2模型全部流程同4.0相比4.0在部分场景下有一定提升,但也有些情况有退步,在[4.0-v2分支]( 这是sovits最后一次更新
## 模型简介
歌声音色转换模型通过SoftVC内容编码器提取源音频语音特征与F0同时输入VITS替换原本的文本输入达到歌声转换的效果。同时更换声码器为 [NSF HiFiGAN]( 解决断音问题
### 4.0版本更新内容
+ 特征输入更换为 [Content Vec](
+ 采样率统一使用44100hz
+ 由于更改了hop size等参数以及精简了部分模型结构推理所需显存占用**大幅降低**4.0版本44khz显存占用甚至小于3.0版本的32khz
+ 调整了部分代码结构
+ 数据集制作、训练过程和3.0保持一致,但模型完全不通用,数据集也需要全部重新预处理
+ 增加了可选项 1vc模式自动预测音高f0,即转换语音时不需要手动输入变调key男女声的调能自动转换但仅限语音转换该模式转换歌声会跑调
+ 增加了可选项 2通过kmeans聚类方案减小音色泄漏即使得音色更加像目标音色
## 预先下载的模型文件
+ contentvec [](
+ 放在`hubert`目录下
+ 预训练底模文件: [G_0.pth]( 与 [D_0.pth](
+ 放在`logs/44k`目录下
# 一键下载
# contentvec
wget -P hubert/
# 也可手动下载放在hubert目录
# G与D预训练模型:
wget -P logs/44k/
wget -P logs/44k/
## 数据集准备
│ ├───xxx1-xxx1.wav
│ ├───...
│ └───Lxx-0xx8.wav
## 数据预处理
1. 重采样至 44100hz
2. 自动划分训练集 验证集 测试集 以及自动生成配置文件
3. 生成hubert与f0
执行完以上步骤后 dataset 目录便是预处理完成的数据可以删除dataset_raw文件夹了
## 训练
python -c configs/config.json -m 44k
训练时会自动清除老的模型只保留最新3个模型如果想防止过拟合需要自己手动备份模型记录点,或修改配置文件keep_ckpts 0为永不清除
## 推理
使用 [](
# 例
python -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "君の知らない物語-src.wav" -t 0 -s "nen"
+ -m, --model_path模型路径。
+ -c, --config_path配置文件路径。
+ -n, --clean_nameswav 文件名列表,放在 raw 文件夹下。
+ -t, --trans音高调整支持正负半音
+ -s, --spk_list合成目标说话人名称。
+ -a, --auto_predict_f0语音转换自动预测音高转换歌声时不要打开这个会严重跑调。
+ -cm, --cluster_model_path聚类模型路径如果没有训练聚类则随便填。
+ -cr, --cluster_infer_ratio聚类方案占比范围 0-1若没有训练聚类模型则填 0 即可。
## 可选项
### 自动f0预测
+ 在inference_main中设置auto_predict_f0为true即可
### 聚类音色泄漏控制
可以线性控制聚类方案与非聚类方案的占比,也就是可以手动在"像目标音色" 和 "咬字清晰" 之间调整比例,找到合适的折中点。
+ 训练过程:
+ 使用cpu性能较好的机器训练据我的经验在腾讯云6核cpu训练每个speaker需要约4分钟即可完成训练
+ 执行python cluster/ ,模型的输出会在 logs/44k/
+ 推理过程:
+ inference_main中指定cluster_model_path
+ inference_main中指定cluster_infer_ratio0为完全不使用聚类1为只使用聚类通常设置0.5即可
## Onnx导出
使用 [](
+ 新建文件夹:`checkpoints` 并打开
+ 在`checkpoints`文件夹中新建一个文件夹作为项目文件夹,文件夹名为你的项目名称,比如`aziplayer`
+ 将你的模型更名为`model.pth`,配置文件更名为`config.json`,并放置到刚才创建的`aziplayer`文件夹下
+ 将 []( 中`path = "NyaruTaffy"` 的 `"NyaruTaffy"` 修改为你的项目名称,`path = "aziplayer"`
+ 运行 [](
+ 等待执行完毕,在你的项目文件夹下会生成一个`model.onnx`,即为导出的模型
### Onnx模型支持的UI
+ [MoeSS](
+ 注意Hubert Onnx模型请使用MoeSS提供的模型目前无法自行导出fairseq中Hubert有不少onnx不支持的算子和涉及到常量的东西在导出时会报错或者导出的模型输入输出shape和结果都有问题
## 一些法律条文参考
#### 《民法典》
##### 第一千零一十九条
##### 第一千零二十四条
##### 第一千零二十七条
# SoftVC VITS Singing Voice Conversion
## Terms of Use
1. This project is established for academic exchange purposes only and is intended for communication and learning purposes. It is not intended for production environments. Please solve the authorization problem of the dataset on your own. You shall be solely responsible for any problems caused by the use of non-authorized datasets for training and all consequences thereof.
2. Any videos based on sovits that are published on video platforms must clearly indicate in the description that they are used for voice changing and specify the input source of the voice or audio, for example, using videos or audios published by others and separating the vocals as input source for conversion, which must provide clear original video or music links. If your own voice or other synthesized voices from other commercial vocal synthesis software are used as the input source for conversion, you must also explain it in the description.
3. You shall be solely responsible for any infringement problems caused by the input source. When using other commercial vocal synthesis software as input source, please ensure that you comply with the terms of use of the software. Note that many vocal synthesis engines clearly state in their terms of use that they cannot be used for input source conversion.
4. Continuing to use this project is deemed as agreeing to the relevant provisions stated in this repository README. This repository README has the obligation to persuade, and is not responsible for any subsequent problems that may arise.
5. If you distribute this repository's code or publish any results produced by this project publicly (including but not limited to video sharing platforms), please indicate the original author and code source (this repository).
6. If you use this project for any other plan, please contact and inform the author of this repository in advance. Thank you very much.
## Update
> Updated the 4.0-v2 model, the entire process is the same as 4.0. Compared to 4.0, there is some improvement in certain scenarios, but there are also some cases where it has regressed. Please refer to the [4.0-v2 branch]( for more information.
## Model Introduction
The singing voice conversion model uses SoftVC content encoder to extract source audio speech features, and inputs them together with F0 to replace the original text input to achieve the effect of song conversion. At the same time, the vocoder is changed to [NSF HiFiGAN]( to solve the problem of sound interruption.
### 4.0 Version Update Content
- Feature input is changed to [Content Vec](
- The sampling rate is unified to use 44100Hz
- Due to the change of hop size and other parameters, as well as the streamlining of some model structures, the required GPU memory for inference is **significantly reduced**. The 44kHz GPU memory usage of version 4.0 is even smaller than the 32kHz usage of version 3.0.
- Some code structures have been adjusted
- The dataset creation and training process are consistent with version 3.0, but the model is completely non-universal, and the data set needs to be fully pre-processed again.
- Added an option 1: automatic pitch prediction for vc mode, which means that you don't need to manually enter the pitch key when converting speech, and the pitch of male and female voices can be automatically converted. However, this mode will cause pitch shift when converting songs.
- Added option 2: reduce timbre leakage through k-means clustering scheme, making the timbre more similar to the target timbre.
## Pre-trained Model Files
- ContentVec: [](
- Place it under the `hubert` directory
- Pre-trained model files: [G_0.pth]( and [D_0.pth](
- Place them under the `logs/44k` directory
# One-click download
# contentvec
wget -P hubert/
# Alternatively, you can manually download and place it in the hubert directory
# Pre-trained G and D models:
wget -P logs/44k/
wget -P logs/44k/
## Dataset Preparation
Simply place the dataset in the `dataset_raw` directory with the following file structure.
│ ├───xxx1-xxx1.wav
│ ├───...
│ └───Lxx-0xx8.wav
## Preprocessing
1. Resample to 44100hz
2. Automatically split the dataset into training, validation, and test sets, and generate configuration files
3. Generate hubert and f0
After completing the above steps, the dataset directory will contain the preprocessed data, and the dataset_raw folder can be deleted.
## Training
python -c configs/config.json -m 44k
Note: During training, the old models will be automatically cleared and only the latest three models will be kept. If you want to prevent overfitting, you need to manually backup the model checkpoints, or modify the configuration file `keep_ckpts` to 0 to never clear them.
## Inference
Use [](
Up to this point, the usage of version 4.0 (training and inference) is exactly the same as version 3.0, with no changes (inference now has command line support).
# Example
python -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "君の知らない物語-src.wav" -t 0 -s "nen"
Required parameters:
- -m, --model_path: path to the model.
- -c, --config_path: path to the configuration file.
- -n, --clean_names: a list of wav file names located in the raw folder.
- -t, --trans: pitch adjustment, supports positive and negative (semitone) values.
- -s, --spk_list: target speaker name for synthesis.
Optional parameters: see the next section
- -a, --auto_predict_f0: automatic pitch prediction for voice conversion, do not enable this when converting songs as it can cause serious pitch issues.
- -cm, --cluster_model_path: path to the clustering model, fill in any value if clustering is not trained.
- -cr, --cluster_infer_ratio: proportion of the clustering solution, range 0-1, fill in 0 if the clustering model is not trained.
## Optional Settings
If the results from the previous section are satisfactory, or if you didn't understand what is being discussed in the following section, you can skip it, and it won't affect the model usage. (These optional settings have a relatively small impact, and they may have some effect on certain specific data, but in most cases, the difference may not be noticeable.)
### Automatic f0 prediction
During the 4.0 model training, an f0 predictor is also trained, which can be used for automatic pitch prediction during voice conversion. However, if the effect is not good, manual pitch prediction can be used instead. But please do not enable this feature when converting singing voice as it may cause serious pitch shifting!
- Set "auto_predict_f0" to true in inference_main.
### Cluster-based timbre leakage control
Introduction: The clustering scheme can reduce timbre leakage and make the trained model sound more like the target's timbre (although this effect is not very obvious), but using clustering alone will lower the model's clarity (the model may sound unclear). Therefore, this model adopts a fusion method to linearly control the proportion of clustering and non-clustering schemes. In other words, you can manually adjust the ratio between "sounding like the target's timbre" and "being clear and articulate" to find a suitable trade-off point.
The existing steps before clustering do not need to be changed. All you need to do is to train an additional clustering model, which has a relatively low training cost.
- Training process:
- Train on a machine with a good CPU performance. According to my experience, it takes about 4 minutes to train each speaker on a Tencent Cloud 6-core CPU.
- Execute "python cluster/". The output of the model will be saved in "logs/44k/".
- Inference process:
- Specify "cluster_model_path" in inference_main.
- Specify "cluster_infer_ratio" in inference_main, where 0 means not using clustering at all, 1 means only using clustering, and usually 0.5 is sufficient.
## Exporting to Onnx
Use [](
- Create a folder named `checkpoints` and open it
- Create a folder in the `checkpoints` folder as your project folder, naming it after your project, for example `aziplayer`
- Rename your model as `model.pth`, the configuration file as `config.json`, and place them in the `aziplayer` folder you just created
- Modify `"NyaruTaffy"` in `path = "NyaruTaffy"` in []( to your project name, `path = "aziplayer"`
- Run [](
- Wait for it to finish running. A `model.onnx` will be generated in your project folder, which is the exported model.
### UI support for Onnx models
- [MoeSS](
Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently, they cannot be exported on their own (Hubert in fairseq has many unsupported operators and things involving constants that can cause errors or result in problems with the input/output shape and results when exported.) [Hubert4.0](
## Some legal provisions for reference
#### 《民法典》
##### 第一千零一十九条
##### 第一千零二十四条
##### 第一千零二十七条

doc/ Normal file
View File

@ -0,0 +1,157 @@
# SoftVC VITS Singing Voice Conversion
## 使用规约
1. 本项目是基于学术交流目的建立,仅供交流与学习使用,并非为生产环境准备,请自行解决数据集的授权问题,任何由于使用非授权数据集进行训练造成的问题,需自行承担全部责任和一切后果!
2. 任何发布到视频平台的基于sovits制作的视频都必须要在简介明确指明用于变声器转换的输入源歌声、音频例如使用他人发布的视频/音频,通过分离的人声作为输入源进行转换的,必须要给出明确的原视频、音乐链接;若使用是自己的人声,或是使用其他歌声合成引擎合成的声音作为输入源进行转换的,也必须在简介加以说明。
3. 由输入源造成的侵权问题需自行承担全部责任和一切后果。使用其他商用歌声合成软件作为输入源时,请确保遵守该软件的使用条例,注意,许多歌声合成引擎使用条例中明确指明不可用于输入源进行转换!
4. 继续使用视为已同意本仓库README所述相关条例本仓库README已进行劝导义务不对后续可能存在问题负责。
5. 如将本仓库代码二次分发,或将由此项目产出的任何结果公开发表 (包括但不限于视频网站投稿),请注明原作者及代码来源 (此仓库)。
6. 如果将此项目用于任何其他企划,请提前联系并告知本仓库作者,十分感谢。
## update
> 更新了4.0-v2模型全部流程同4.0相比4.0在部分场景下有一定提升,但也有些情况有退步,在[4.0-v2分支](
## 模型简介
歌声音色转换模型通过SoftVC内容编码器提取源音频语音特征与F0同时输入VITS替换原本的文本输入达到歌声转换的效果。同时更换声码器为 [NSF HiFiGAN]( 解决断音问题
### 4.0版本更新内容
+ 特征输入更换为 [Content Vec](
+ 采样率统一使用44100hz
+ 由于更改了hop size等参数以及精简了部分模型结构推理所需显存占用**大幅降低**4.0版本44khz显存占用甚至小于3.0版本的32khz
+ 调整了部分代码结构
+ 数据集制作、训练过程和3.0保持一致,但模型完全不通用,数据集也需要全部重新预处理
+ 增加了可选项 1vc模式自动预测音高f0,即转换语音时不需要手动输入变调key男女声的调能自动转换但仅限语音转换该模式转换歌声会跑调
+ 增加了可选项 2通过kmeans聚类方案减小音色泄漏即使得音色更加像目标音色
## 预先下载的模型文件
+ contentvec [](
+ 放在`hubert`目录下
+ 预训练底模文件: [G_0.pth]( 与 [D_0.pth](
+ 放在`logs/44k`目录下
# 一键下载
# contentvec
wget -P hubert/
# 也可手动下载放在hubert目录
# G与D预训练模型:
wget -P logs/44k/
wget -P logs/44k/
## 数据集准备
│ ├───xxx1-xxx1.wav
│ ├───...
│ └───Lxx-0xx8.wav
## 数据预处理
1. 重采样至 44100hz
2. 自动划分训练集 验证集 测试集 以及自动生成配置文件
3. 生成hubert与f0
执行完以上步骤后 dataset 目录便是预处理完成的数据可以删除dataset_raw文件夹了
## 训练
python -c configs/config.json -m 44k
训练时会自动清除老的模型只保留最新3个模型如果想防止过拟合需要自己手动备份模型记录点,或修改配置文件keep_ckpts 0为永不清除
## 推理
使用 [](
# 例
python -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "君の知らない物語-src.wav" -t 0 -s "nen"
+ -m, --model_path模型路径。
+ -c, --config_path配置文件路径。
+ -n, --clean_nameswav 文件名列表,放在 raw 文件夹下。
+ -t, --trans音高调整支持正负半音
+ -s, --spk_list合成目标说话人名称。
+ -a, --auto_predict_f0语音转换自动预测音高转换歌声时不要打开这个会严重跑调。
+ -cm, --cluster_model_path聚类模型路径如果没有训练聚类则随便填。
+ -cr, --cluster_infer_ratio聚类方案占比范围 0-1若没有训练聚类模型则填 0 即可。
## 可选项
### 自动f0预测
+ 在inference_main中设置auto_predict_f0为true即可
### 聚类音色泄漏控制
可以线性控制聚类方案与非聚类方案的占比,也就是可以手动在"像目标音色" 和 "咬字清晰" 之间调整比例,找到合适的折中点。
+ 训练过程:
+ 使用cpu性能较好的机器训练据我的经验在腾讯云6核cpu训练每个speaker需要约4分钟即可完成训练
+ 执行python cluster/ ,模型的输出会在 logs/44k/
+ 推理过程:
+ inference_main中指定cluster_model_path
+ inference_main中指定cluster_infer_ratio0为完全不使用聚类1为只使用聚类通常设置0.5即可
## Onnx导出
使用 [](
+ 新建文件夹:`checkpoints` 并打开
+ 在`checkpoints`文件夹中新建一个文件夹作为项目文件夹,文件夹名为你的项目名称,比如`aziplayer`
+ 将你的模型更名为`model.pth`,配置文件更名为`config.json`,并放置到刚才创建的`aziplayer`文件夹下
+ 将 []( 中`path = "NyaruTaffy"` 的 `"NyaruTaffy"` 修改为你的项目名称,`path = "aziplayer"`
+ 运行 [](
+ 等待执行完毕,在你的项目文件夹下会生成一个`model.onnx`,即为导出的模型
### Onnx模型支持的UI
+ [MoeSS](
+ 注意Hubert Onnx模型请使用MoeSS提供的模型目前无法自行导出fairseq中Hubert有不少onnx不支持的算子和涉及到常量的东西在导出时会报错或者导出的模型输入输出shape和结果都有问题
## 一些法律条文参考
#### 《民法典》
##### 第一千零一十九条
##### 第一千零二十四条
##### 第一千零二十七条