Compare commits

...

6 Commits

Author SHA1 Message Date
謬紗特 08c70ff3d2
Update README.md 2023-03-12 16:05:05 +08:00
謬紗特 efa1e46cae
update README.md 2023-03-12 16:03:59 +08:00
謬紗特 13a6a645fa
add colab notebook 2023-03-12 15:59:50 +08:00
謬紗特 c092d25716
Increase convenience for colab training 2023-03-12 15:59:29 +08:00
红血球AE3803 d40ae694fe
Add files via upload 2023-03-12 16:39:00 +09:00
红血球AE3803 a109d84483
Add files via upload 2023-03-12 16:37:51 +09:00
5 changed files with 564 additions and 329 deletions

328
README.md
View File

@ -1,140 +1,188 @@
# SoftVC VITS Singing Voice Conversion
## 使用规约
1. 请自行解决数据集的授权问题任何由于使用非授权数据集进行训练造成的问题需自行承担全部责任和一切后果与sovits无关
2. 任何发布到视频平台的基于sovits制作的视频都必须要在简介明确指明用于变声器转换的输入源歌声、音频例如使用他人发布的视频/音频,通过分离的人声作为输入源进行转换的,必须要给出明确的原视频、音乐链接;若使用是自己的人声,或是使用其他歌声合成引擎合成的声音作为输入源进行转换的,也必须在简介加以说明。
3. 由输入源造成的侵权问题需自行承担全部责任和一切后果。使用其他商用歌声合成软件作为输入源时,请确保遵守该软件的使用条例,注意,许多歌声合成引擎使用条例中明确指明不可用于输入源进行转换!
## 模型简介
歌声音色转换模型,使用[Content Vec](https://github.com/auspicious3000/contentvec) 提取内容特征输入visinger2模型合成目标声音
### 4.0 v2版本更新内容
+ 模型架构完全修改成[visinger2](https://github.com/zhangyongmao/VISinger2) 架构
+ 其他和4.0完全一致
### 4.0 v2版本特点
+ 在部分场景下比4.0有一定提升(例如部分场景的呼吸音电流音问题)
+ 但也有部分场景效果也有一定倒退例如在猫雷数据上训练出来效果并不如4.0,而且在部分情况会合成出很鬼畜的声音
+ 至于炼老的还是v2 可以自己尝试下面的demo和4.0分支上的demo后对比决定
+ 4.0-v2是sovits的最后一个版本之后不会再有更新在基本验证没有大的bug后sovits即将Archive
在线demo[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/innnky/sovits4.0-v2)
## 注意
+ 4.0-v2全部流程与4.0相同环境与4.0相同4.0预处理完成的数据和环境可以直接用
+ 与4.0不同的地方在于:
+ 模型**完全** 不通用,旧模型不可使用,底模也需要使用全新的底模, 请确保你加载了正确的底模否则训练时间会究极长!
+ config文件结构很不一样不要使用老的config如果是使用4.0的数据集则只需要执行preprocess_flist_config.py这一步生成新的config
## 预先下载的模型文件
+ contentvec [checkpoint_best_legacy_500.pt](https://ibm.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr)
+ 放在`hubert`目录下
+ 预训练底模文件: [G_0.pth](https://huggingface.co/innnky/sovits_pretrained/resolve/main/sovits4.0-v2/G_0.pth) 与 [D_0.pth](https://huggingface.co/innnky/sovits_pretrained/resolve/main/sovits4.0-v2/D_0.pth)
+ 放在`logs/44k`目录下
+ 预训练底模训练数据集覆盖男女生常见音域,可以认为是相对通用的底模
```shell
# 一键下载
# contentvec
wget -P hubert/ http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best_legacy_500.pt
# 也可手动下载放在hubert目录
# G与D预训练模型:
wget -P logs/44k/ https://huggingface.co/innnky/sovits_pretrained/resolve/main/sovits4.0-v2/G_0.pth
wget -P logs/44k/ https://huggingface.co/innnky/sovits_pretrained/resolve/main/sovits4.0-v2/D_0.pth
```
[//]: # (## colab一键数据集制作、训练脚本)
[//]: # ([![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/19fxpo-ZoL_ShEUeZIZi6Di-YioWrEyhR#scrollTo=0gQcIZ8RsOkn))
后面部分的readme和4.0一样了,没有变化
## 数据集准备
仅需要以以下文件结构将数据集放入dataset_raw目录即可
```shell
dataset_raw
├───speaker0
│ ├───xxx1-xxx1.wav
│ ├───...
│ └───Lxx-0xx8.wav
└───speaker1
├───xx2-0xxx2.wav
├───...
└───xxx7-xxx007.wav
```
## 数据预处理
1. 重采样至 44100hz
```shell
python resample.py
```
2. 自动划分训练集 验证集 测试集 以及自动生成配置文件
```shell
python preprocess_flist_config.py
```
3. 生成hubert与f0
```shell
python preprocess_hubert_f0.py
```
执行完以上步骤后 dataset 目录便是预处理完成的数据可以删除dataset_raw文件夹了
## 训练
```shell
python train.py -c configs/config.json -m 44k
```
训练时会自动清除老的模型只保留最新3个模型如果想防止过拟合需要自己手动备份模型记录点,或修改配置文件keep_ckpts 0为永不清除
## 推理
使用 [inference_main.py](inference_main.py)
截止此处4.0使用方法训练、推理和3.0完全一致,没有任何变化(推理增加了命令行支持)
```shell
# 例
python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "君の知らない物語-src.wav" -t 0 -s "nen"
```
必填项部分
+ -m, --model_path模型路径。
+ -c, --config_path配置文件路径。
+ -n, --clean_nameswav 文件名列表,放在 raw 文件夹下。
+ -t, --trans音高调整支持正负半音
+ -s, --spk_list合成目标说话人名称。
可选项部分:见下一节
+ -a, --auto_predict_f0语音转换自动预测音高转换歌声时不要打开这个会严重跑调。
+ -cm, --cluster_model_path聚类模型路径如果没有训练聚类则随便填。
+ -cr, --cluster_infer_ratio聚类方案占比范围 0-1若没有训练聚类模型则填 0 即可。
## 可选项
如果前面的效果已经满意,或者没看明白下面在讲啥,那后面的内容都可以忽略,不影响模型使用。(这些可选项影响比较小,可能在某些特定数据上有点效果,但大部分情况似乎都感知不太明显)
### 自动f0预测
4.0模型训练过程会训练一个f0预测器对于语音转换可以开启自动音高预测如果效果不好也可以使用手动的但转换歌声时请不要启用此功能会严重跑调
+ 在inference_main中设置auto_predict_f0为true即可
### 聚类音色泄漏控制
介绍:聚类方案可以减小音色泄漏,使得模型训练出来更像目标的音色(但其实不是特别明显),但是单纯的聚类方案会降低模型的咬字(会口齿不清)(这个很明显),本模型采用了融合的方式,
可以线性控制聚类方案与非聚类方案的占比,也就是可以手动在"像目标音色" 和 "咬字清晰" 之间调整比例,找到合适的折中点。
使用聚类前面的已有步骤不用进行任何的变动,只需要额外训练一个聚类模型,虽然效果比较有限,但训练成本也比较低
+ 训练过程:
+ 使用cpu性能较好的机器训练据我的经验在腾讯云6核cpu训练每个speaker需要约4分钟即可完成训练
+ 执行python cluster/train_cluster.py ,模型的输出会在 logs/44k/kmeans_10000.pt
+ 推理过程:
+ inference_main中指定cluster_model_path
+ inference_main中指定cluster_infer_ratio0为完全不使用聚类1为只使用聚类通常设置0.5即可
## Onnx导出
使用 [onnx_export.py](onnx_export.py)
+ 新建文件夹:`checkpoints` 并打开
+ 在`checkpoints`文件夹中新建一个文件夹作为项目文件夹,文件夹名为你的项目名称,比如`aziplayer`
+ 将你的模型更名为`model.pth`,配置文件更名为`config.json`,并放置到刚才创建的`aziplayer`文件夹下
+ 将 [onnx_export.py](onnx_export.py) 中`path = "NyaruTaffy"` 的 `"NyaruTaffy"` 修改为你的项目名称,`path = "aziplayer"`
+ 运行 [onnx_export.py](onnx_export.py)
+ 等待执行完毕,在你的项目文件夹下会生成一个`model.onnx`,即为导出的模型
### Onnx模型支持的UI
+ [MoeSS](https://github.com/NaruseMioShirakana/MoeSS)
+ 我去除了所有的训练用函数和一切复杂的转置一行都没有保留因为我认为只有去除了这些东西才知道你用的是Onnx
+ 注意Hubert Onnx模型请使用MoeSS提供的模型目前无法自行导出fairseq中Hubert有不少onnx不支持的算子和涉及到常量的东西在导出时会报错或者导出的模型输入输出shape和结果都有问题
[Hubert4.0](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel)
# SoftVC VITS Singing Voice Conversion
[**English**](./README.md) | [**中文简体**](./README_zh_CN.md)
## Terms of Use
1. This project is established for academic exchange purposes only and is intended for communication and learning purposes. It is not intended for production environments. Please solve the authorization problem of the dataset on your own. You shall be solely responsible for any problems caused by the use of non-authorized datasets for training and all consequences thereof.
2. Any videos based on sovits that are published on video platforms must clearly indicate in the description that they are used for voice changing and specify the input source of the voice or audio, for example, using videos or audios published by others and separating the vocals as input source for conversion, which must provide clear original video or music links. If your own voice or other synthesized voices from other commercial vocal synthesis software are used as the input source for conversion, you must also explain it in the description.
3. You shall be solely responsible for any infringement problems caused by the input source. When using other commercial vocal synthesis software as input source, please ensure that you comply with the terms of use of the software. Note that many vocal synthesis engines clearly state in their terms of use that they cannot be used for input source conversion.
4. Continuing to use this project is deemed as agreeing to the relevant provisions stated in this repository README. This repository README has the obligation to persuade, and is not responsible for any subsequent problems that may arise.
5. If you distribute this repository's code or publish any results produced by this project publicly (including but not limited to video sharing platforms), please indicate the original author and code source (this repository).
6. If you use this project for any other plan, please contact and inform the author of this repository in advance. Thank you very much.
## Model Introduction
The singing voice conversion model uses SoftVC content encoder to extract source audio speech features, and inputs them together with F0 to replace the original text input to achieve the effect of song conversion. At the same time, the vocoder is changed to [NSF HiFiGAN](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan) to solve the problem of sound interruption.
### 4.0 v2 update content
+ The model architecture is completely change to [visinger2](https://github.com/zhangyongmao/VISinger2)
+ Others are exactly the same as [4.0](https://github.com/svc-develop-team/so-vits-svc/tree/4.0).
### 4.0 v2 features
+ It is better than 4.0 in some scenes.For example, the current sound in the breath sound
+ But there is also a certain retrogression in some scene. For example, training with data from streaming of vtubers is not as good as [4.0](https://github.com/svc-develop-team/so-vits-svc/tree/4.0). Also in some cases it will turn out a terrible sound.
+ [4.0-v2](https://github.com/svc-develop-team/so-vits-svc/tree/4.0-v2) is the last version of sovits, there is no more update in the future.
## Note
+ [4.0-v2](https://github.com/svc-develop-team/so-vits-svc/tree/4.0-v2) and [4.0](https://github.com/svc-develop-team/so-vits-svc/tree/4.0) are almost identical in process, which include preprocessing and requirements.
+ The difference from 4.0 is:
+ The models are **completely different**. Check the version of the pretrained models if you are using them.
+ The structure of config file changed a lot. You can only run `python preprocess_flist_config.py` to generate new `config.json` if you are using preprocessed dataset from 4.0.
## Pre-trained Model Files
#### **Required**
- ContentVec: [checkpoint_best_legacy_500.pt](https://ibm.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr)
- Place it under the `hubert` directory
```shell
# contentvec
wget -P hubert/ http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best_legacy_500.pt
# Alternatively, you can manually download and place it in the hubert directory
```
#### **Optional(Strongly recommend)**
- Pre-trained model files: `G_0.pth` `D_0.pth`
- Place them under the `logs/44k` directory
Get them from svc-develop-team(TBD) or anywhere else.
Although the pretrained model generally does not cause any copyright problems, please pay attention to it. For example, ask the author in advance, or the author has indicated the feasible use in the description clearly.
## Dataset Preparation
Simply place the dataset in the `dataset_raw` directory with the following file structure.
```shell
dataset_raw
├───speaker0
│ ├───xxx1-xxx1.wav
│ ├───...
│ └───Lxx-0xx8.wav
└───speaker1
├───xx2-0xxx2.wav
├───...
└───xxx7-xxx007.wav
```
## Preprocessing
1. Resample to 44100hz
```shell
python resample.py
```
2. Automatically split the dataset into training, validation, and test sets, and generate configuration files
```shell
python preprocess_flist_config.py
```
3. Generate hubert and f0
```shell
python preprocess_hubert_f0.py
```
After completing the above steps, the dataset directory will contain the preprocessed data, and the dataset_raw folder can be deleted.
## Training
```shell
python train.py -c configs/config.json -m 44k
```
Note: During training, the old models will be automatically cleared and only the latest three models will be kept. If you want to prevent overfitting, you need to manually backup the model checkpoints, or modify the configuration file `keep_ckpts` to 0 to never clear them.
## Inference
Use [inference_main.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/inference_main.py)
Up to this point, the usage of version 4.0 (training and inference) is exactly the same as version 3.0, with no changes (inference now has command line support).
```shell
# Example
python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "君の知らない物語-src.wav" -t 0 -s "nen"
```
Required parameters:
- -m, --model_path: path to the model.
- -c, --config_path: path to the configuration file.
- -n, --clean_names: a list of wav file names located in the raw folder.
- -t, --trans: pitch adjustment, supports positive and negative (semitone) values.
- -s, --spk_list: target speaker name for synthesis.
Optional parameters: see the next section
- -a, --auto_predict_f0: automatic pitch prediction for voice conversion, do not enable this when converting songs as it can cause serious pitch issues.
- -cm, --cluster_model_path: path to the clustering model, fill in any value if clustering is not trained.
- -cr, --cluster_infer_ratio: proportion of the clustering solution, range 0-1, fill in 0 if the clustering model is not trained.
## Optional Settings
If the results from the previous section are satisfactory, or if you didn't understand what is being discussed in the following section, you can skip it, and it won't affect the model usage. (These optional settings have a relatively small impact, and they may have some effect on certain specific data, but in most cases, the difference may not be noticeable.)
### Automatic f0 prediction
During the 4.0 model training, an f0 predictor is also trained, which can be used for automatic pitch prediction during voice conversion. However, if the effect is not good, manual pitch prediction can be used instead. But please do not enable this feature when converting singing voice as it may cause serious pitch shifting!
- Set "auto_predict_f0" to true in inference_main.
### Cluster-based timbre leakage control
Introduction: The clustering scheme can reduce timbre leakage and make the trained model sound more like the target's timbre (although this effect is not very obvious), but using clustering alone will lower the model's clarity (the model may sound unclear). Therefore, this model adopts a fusion method to linearly control the proportion of clustering and non-clustering schemes. In other words, you can manually adjust the ratio between "sounding like the target's timbre" and "being clear and articulate" to find a suitable trade-off point.
The existing steps before clustering do not need to be changed. All you need to do is to train an additional clustering model, which has a relatively low training cost.
- Training process:
- Train on a machine with a good CPU performance. According to my experience, it takes about 4 minutes to train each speaker on a Tencent Cloud 6-core CPU.
- Execute "python cluster/train_cluster.py". The output of the model will be saved in "logs/44k/kmeans_10000.pt".
- Inference process:
- Specify "cluster_model_path" in inference_main.
- Specify "cluster_infer_ratio" in inference_main, where 0 means not using clustering at all, 1 means only using clustering, and usually 0.5 is sufficient.
### [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/18KxJs7FCPjlTY2l0QUbDNfZnLrS9hL4m?usp=sharing) [sovits4v2 for colab.ipynb](https://colab.research.google.com/drive/18KxJs7FCPjlTY2l0QUbDNfZnLrS9hL4m?usp=sharing)
## Exporting to Onnx
Use [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/onnx_export.py)
- Create a folder named `checkpoints` and open it.
- Create a folder in the `checkpoints` folder as your project folder, naming it after your project, for example `aziplayer`.
- Rename your model as `model.pth`, the configuration file as `config.json`, and place them in the `aziplayer` folder you just created.
- Modify `"NyaruTaffy"` in `path = "NyaruTaffy"` in [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/onnx_export.py) to your project name, `path = "aziplayer"`.
- Run [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/onnx_export.py).
- Wait for it to finish running. A `model.onnx` will be generated in your project folder, which is the exported model.
### UI support for Onnx models
- [MoeSS](https://github.com/NaruseMioShirakana/MoeSS)
Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently, they cannot be exported on their own (Hubert in fairseq has many unsupported operators and things involving constants that can cause errors or result in problems with the input/output shape and results when exported.) [Hubert4.0](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel)
## Some legal provisions for reference
#### 《民法典》
##### 第一千零一十九条
任何组织或者个人不得以丑化、污损,或者利用信息技术手段伪造等方式侵害他人的肖像权。未经肖像权人同意,不得制作、使用、公开肖像权人的肖像,但是法律另有规定的除外。
未经肖像权人同意,肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。
对自然人声音的保护,参照适用肖像权保护的有关规定。
##### 第一千零二十四条
【名誉权】民事主体享有名誉权。任何组织或者个人不得以侮辱、诽谤等方式侵害他人的名誉权。
##### 第一千零二十七条
【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象,含有侮辱、诽谤内容,侵害他人名誉权的,受害人有权依法请求该行为人承担民事责任。
行为人发表的文学、艺术作品不以特定人为描述对象,仅其中的情节与该特定人的情况相似的,不承担民事责任。

187
README_zh_CN.md Normal file
View File

@ -0,0 +1,187 @@
# SoftVC VITS Singing Voice Conversion
[**English**](./README.md) | [**中文简体**](./README_zh_CN.md)
## 使用规约
1. 本项目是基于学术交流目的建立,仅供交流与学习使用,并非为生产环境准备,请自行解决数据集的授权问题,任何由于使用非授权数据集进行训练造成的问题,需自行承担全部责任和一切后果!
2. 任何发布到视频平台的基于 sovits 制作的视频,都必须要在简介明确指明用于变声器转换的输入源歌声、音频,例如:使用他人发布的视频 / 音频,通过分离的人声作为输入源进行转换的,必须要给出明确的原视频、音乐链接;若使用是自己的人声,或是使用其他歌声合成引擎合成的声音作为输入源进行转换的,也必须在简介加以说明。
3. 由输入源造成的侵权问题需自行承担全部责任和一切后果。使用其他商用歌声合成软件作为输入源时,请确保遵守该软件的使用条例,注意,许多歌声合成引擎使用条例中明确指明不可用于输入源进行转换!
4. 继续使用视为已同意本仓库 README 所述相关条例,本仓库 README 已进行劝导义务,不对后续可能存在问题负责。
5. 如将本仓库代码二次分发,或将由此项目产出的任何结果公开发表 (包括但不限于视频网站投稿),请注明原作者及代码来源 (此仓库)。
6. 如果将此项目用于任何其他企划,请提前联系并告知本仓库作者,十分感谢。
## 模型简介
歌声音色转换模型,使用[Content Vec](https://github.com/auspicious3000/contentvec) 提取内容特征输入visinger2模型合成目标声音
### 4.0 v2版本更新内容
+ 模型架构完全修改成[visinger2](https://github.com/zhangyongmao/VISinger2) 架构
+ 其他和4.0完全一致
### 4.0 v2版本特点
+ 在部分场景下比4.0有一定提升(例如部分场景的呼吸音电流音问题)
+ 但也有部分场景效果也有一定倒退例如在猫雷数据上训练出来效果并不如4.0,而且在部分情况会合成出很鬼畜的声音
+ 4.0-v2是sovits的最后一个版本之后不会再有更新
## 注意
+ 4.0-v2全部流程与4.0相同环境与4.0相同4.0预处理完成的数据和环境可以直接用
+ 与4.0不同的地方在于:
+ 模型**完全** 不通用,旧模型不可使用,底模也需要使用全新的底模, 请确保你加载了正确的底模否则训练时间会究极长!
+ config文件结构很不一样不要使用老的config如果是使用4.0的数据集则只需要执行preprocess_flist_config.py这一步生成新的config
## 预先下载的模型文件
#### **必须项**
+ contentvec [checkpoint_best_legacy_500.pt](https://ibm.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr)
+ 放在`hubert`目录下
```shell
# contentvec
http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best_legacy_500.pt
# 也可手动下载放在hubert目录
```
#### **可选项(强烈建议使用)**
+ 预训练底模文件: `G_0.pth` `D_0.pth`
+ 放在`logs/44k`目录下
从svc-develop-team(待定)或任何其他地方获取
虽然底模一般不会引起什么版权问题,但还是请注意一下,比如事先询问作者,又或者作者在模型描述中明确写明了可行的用途
后面部分的readme和4.0一样了,没有变化
## 数据集准备
仅需要以以下文件结构将数据集放入dataset_raw目录即可
```shell
dataset_raw
├───speaker0
│ ├───xxx1-xxx1.wav
│ ├───...
│ └───Lxx-0xx8.wav
└───speaker1
├───xx2-0xxx2.wav
├───...
└───xxx7-xxx007.wav
```
## 数据预处理
1. 重采样至 44100hz
```shell
python resample.py
```
2. 自动划分训练集 验证集 测试集 以及自动生成配置文件
```shell
python preprocess_flist_config.py
```
3. 生成hubert与f0
```shell
python preprocess_hubert_f0.py
```
执行完以上步骤后 dataset 目录便是预处理完成的数据可以删除dataset_raw文件夹了
## 训练
```shell
python train.py -c configs/config.json -m 44k
```
训练时会自动清除老的模型只保留最新3个模型如果想防止过拟合需要自己手动备份模型记录点,或修改配置文件keep_ckpts 0为永不清除
## 推理
使用 [inference_main.py](inference_main.py)
截止此处4.0使用方法训练、推理和3.0完全一致,没有任何变化(推理增加了命令行支持)
```shell
# 例
python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "君の知らない物語-src.wav" -t 0 -s "nen"
```
必填项部分
+ -m, --model_path模型路径。
+ -c, --config_path配置文件路径。
+ -n, --clean_nameswav 文件名列表,放在 raw 文件夹下。
+ -t, --trans音高调整支持正负半音
+ -s, --spk_list合成目标说话人名称。
可选项部分:见下一节
+ -a, --auto_predict_f0语音转换自动预测音高转换歌声时不要打开这个会严重跑调。
+ -cm, --cluster_model_path聚类模型路径如果没有训练聚类则随便填。
+ -cr, --cluster_infer_ratio聚类方案占比范围 0-1若没有训练聚类模型则填 0 即可。
## 可选项
如果前面的效果已经满意,或者没看明白下面在讲啥,那后面的内容都可以忽略,不影响模型使用(这些可选项影响比较小,可能在某些特定数据上有点效果,但大部分情况似乎都感知不太明显)
### 自动f0预测
4.0模型训练过程会训练一个f0预测器对于语音转换可以开启自动音高预测如果效果不好也可以使用手动的但转换歌声时请不要启用此功能会严重跑调
+ 在inference_main中设置auto_predict_f0为true即可
### 聚类音色泄漏控制
介绍:聚类方案可以减小音色泄漏,使得模型训练出来更像目标的音色(但其实不是特别明显),但是单纯的聚类方案会降低模型的咬字(会口齿不清)(这个很明显),本模型采用了融合的方式,
可以线性控制聚类方案与非聚类方案的占比,也就是可以手动在"像目标音色" 和 "咬字清晰" 之间调整比例,找到合适的折中点。
使用聚类前面的已有步骤不用进行任何的变动,只需要额外训练一个聚类模型,虽然效果比较有限,但训练成本也比较低
+ 训练过程:
+ 使用cpu性能较好的机器训练据我的经验在腾讯云6核cpu训练每个speaker需要约4分钟即可完成训练
+ 执行python cluster/train_cluster.py ,模型的输出会在 logs/44k/kmeans_10000.pt
+ 推理过程:
+ inference_main中指定cluster_model_path
+ inference_main中指定cluster_infer_ratio0为完全不使用聚类1为只使用聚类通常设置0.5即可
### [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/18KxJs7FCPjlTY2l0QUbDNfZnLrS9hL4m?usp=sharing) [sovits4v2 for colab.ipynb](https://colab.research.google.com/drive/18KxJs7FCPjlTY2l0QUbDNfZnLrS9hL4m?usp=sharing)
## Onnx导出
使用 [onnx_export.py](onnx_export.py)
+ 新建文件夹:`checkpoints` 并打开
+ 在`checkpoints`文件夹中新建一个文件夹作为项目文件夹,文件夹名为你的项目名称,比如`aziplayer`
+ 将你的模型更名为`model.pth`,配置文件更名为`config.json`,并放置到刚才创建的`aziplayer`文件夹下
+ 将 [onnx_export.py](onnx_export.py) 中`path = "NyaruTaffy"` 的 `"NyaruTaffy"` 修改为你的项目名称,`path = "aziplayer"`
+ 运行 [onnx_export.py](onnx_export.py)
+ 等待执行完毕,在你的项目文件夹下会生成一个`model.onnx`,即为导出的模型
### Onnx模型支持的UI
+ [MoeSS](https://github.com/NaruseMioShirakana/MoeSS)
+ 我去除了所有的训练用函数和一切复杂的转置一行都没有保留因为我认为只有去除了这些东西才知道你用的是Onnx
+ 注意Hubert Onnx模型请使用MoeSS提供的模型目前无法自行导出fairseq中Hubert有不少onnx不支持的算子和涉及到常量的东西在导出时会报错或者导出的模型输入输出shape和结果都有问题
[Hubert4.0](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel)
## 一些法律条例参考
#### 《民法典》
##### 第一千零一十九条
任何组织或者个人不得以丑化、污损,或者利用信息技术手段伪造等方式侵害他人的肖像权。未经肖像权人同意,不得制作、使用、公开肖像权人的肖像,但是法律另有规定的除外。
未经肖像权人同意,肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。
对自然人声音的保护,参照适用肖像权保护的有关规定。
##### 第一千零二十四条
【名誉权】民事主体享有名誉权。任何组织或者个人不得以侮辱、诽谤等方式侵害他人的名誉权。
##### 第一千零二十七条
【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象,含有侮辱、诽谤内容,侵害他人名誉权的,受害人有权依法请求该行为人承担民事责任。
行为人发表的文学、艺术作品不以特定人为描述对象,仅其中的情节与该特定人的情况相似的,不承担民事责任。

View File

@ -1,106 +0,0 @@
{
"train": {
"log_interval": 50,
"eval_interval": 1000,
"seed": 1234,
"port": 8001,
"epochs": 10000,
"learning_rate": 0.0002,
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"batch_size": 6,
"accumulation_steps": 1,
"fp16_run": false,
"lr_decay": 0.998,
"segment_size": 10240,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"keep_ckpts":4
},
"data": {
"data_dir": "dataset",
"dataset_type": "SingDataset",
"collate_type": "SingCollate",
"training_filelist": "filelists/train.txt",
"validation_filelist": "filelists/val.txt",
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"n_fft": 2048,
"fmin": 0,
"fmax": 22050,
"hop_length": 512,
"win_size": 2048,
"acoustic_dim": 80,
"c_dim": 256,
"min_level_db": -115,
"ref_level_db": 20,
"min_db": -115,
"max_abs_value": 4.0,
"n_speakers": 200
},
"model": {
"hidden_channels": 192,
"spk_channels": 192,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 4,
"kernel_size": 3,
"p_dropout": 0.1,
"prior_hidden_channels": 192,
"prior_filter_channels": 768,
"prior_n_heads": 2,
"prior_n_layers": 4,
"prior_kernel_size": 3,
"prior_p_dropout": 0.1,
"resblock": "1",
"use_spectral_norm": false,
"resblock_kernel_sizes": [
3,
7,
11
],
"resblock_dilation_sizes": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"upsample_rates": [
8,
8,
4,
2
],
"upsample_initial_channel": 256,
"upsample_kernel_sizes": [
16,
16,
8,
4
],
"n_harmonic": 64,
"n_bands": 65
},
"spk": {
"jishuang": 0,
"huiyu": 1,
"nen": 2,
"paimon": 3,
"yunhao": 4
}
}

View File

@ -0,0 +1,106 @@
{
"train": {
"log_interval": 50,
"eval_interval": 1000,
"seed": 1234,
"port": 8001,
"epochs": 10000,
"learning_rate": 0.0002,
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"batch_size": 6,
"accumulation_steps": 1,
"fp16_run": false,
"lr_decay": 0.998,
"segment_size": 10240,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"keep_ckpts":4
},
"data": {
"data_dir": "dataset",
"dataset_type": "SingDataset",
"collate_type": "SingCollate",
"training_filelist": "filelists/train.txt",
"validation_filelist": "filelists/val.txt",
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"n_fft": 2048,
"fmin": 0,
"fmax": 22050,
"hop_length": 512,
"win_size": 2048,
"acoustic_dim": 80,
"c_dim": 256,
"min_level_db": -115,
"ref_level_db": 20,
"min_db": -115,
"max_abs_value": 4.0,
"n_speakers": 200
},
"model": {
"hidden_channels": 192,
"spk_channels": 192,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 4,
"kernel_size": 3,
"p_dropout": 0.1,
"prior_hidden_channels": 192,
"prior_filter_channels": 768,
"prior_n_heads": 2,
"prior_n_layers": 4,
"prior_kernel_size": 3,
"prior_p_dropout": 0.1,
"resblock": "1",
"use_spectral_norm": false,
"resblock_kernel_sizes": [
3,
7,
11
],
"resblock_dilation_sizes": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"upsample_rates": [
8,
8,
4,
2
],
"upsample_initial_channel": 256,
"upsample_kernel_sizes": [
16,
16,
8,
4
],
"n_harmonic": 64,
"n_bands": 65
},
"spk": {
"jishuang": 0,
"huiyu": 1,
"nen": 2,
"paimon": 3,
"yunhao": 4
}
}

View File

@ -1,83 +1,83 @@
import os
import argparse
import re
from tqdm import tqdm
from random import shuffle
import json
import wave
config_template = json.load(open("configs/config.json"))
pattern = re.compile(r'^[\.a-zA-Z0-9_\/]+$')
def get_wav_duration(file_path):
with wave.open(file_path, 'rb') as wav_file:
# 获取音频帧数
n_frames = wav_file.getnframes()
# 获取采样率
framerate = wav_file.getframerate()
# 计算时长(秒)
duration = n_frames / float(framerate)
return duration
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--train_list", type=str, default="./filelists/train.txt", help="path to train list")
parser.add_argument("--val_list", type=str, default="./filelists/val.txt", help="path to val list")
parser.add_argument("--test_list", type=str, default="./filelists/test.txt", help="path to test list")
parser.add_argument("--source_dir", type=str, default="./dataset/44k", help="path to source dir")
args = parser.parse_args()
train = []
val = []
test = []
idx = 0
spk_dict = {}
spk_id = 0
for speaker in tqdm(os.listdir(args.source_dir)):
spk_dict[speaker] = spk_id
spk_id += 1
wavs = ["/".join([args.source_dir, speaker, i]) for i in os.listdir(os.path.join(args.source_dir, speaker))]
new_wavs = []
for file in wavs:
if not file.endswith("wav"):
continue
if not pattern.match(file):
print(f"warning文件名{file}中包含非字母数字下划线,可能会导致错误。(也可能不会)")
if get_wav_duration(file) < 0.3:
print("skip too short audio:", file)
continue
new_wavs.append(file)
wavs = new_wavs
shuffle(wavs)
train += wavs[2:-2]
val += wavs[:2]
test += wavs[-2:]
shuffle(train)
shuffle(val)
shuffle(test)
print("Writing", args.train_list)
with open(args.train_list, "w") as f:
for fname in tqdm(train):
wavpath = fname
f.write(wavpath + "\n")
print("Writing", args.val_list)
with open(args.val_list, "w") as f:
for fname in tqdm(val):
wavpath = fname
f.write(wavpath + "\n")
print("Writing", args.test_list)
with open(args.test_list, "w") as f:
for fname in tqdm(test):
wavpath = fname
f.write(wavpath + "\n")
config_template["spk"] = spk_dict
print("Writing configs/config.json")
with open("configs/config.json", "w") as f:
json.dump(config_template, f, indent=2)
import os
import argparse
import re
from tqdm import tqdm
from random import shuffle
import json
import wave
config_template = json.load(open("configs_template/config_template.json"))
pattern = re.compile(r'^[\.a-zA-Z0-9_\/]+$')
def get_wav_duration(file_path):
with wave.open(file_path, 'rb') as wav_file:
# 获取音频帧数
n_frames = wav_file.getnframes()
# 获取采样率
framerate = wav_file.getframerate()
# 计算时长(秒)
duration = n_frames / float(framerate)
return duration
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--train_list", type=str, default="./filelists/train.txt", help="path to train list")
parser.add_argument("--val_list", type=str, default="./filelists/val.txt", help="path to val list")
parser.add_argument("--test_list", type=str, default="./filelists/test.txt", help="path to test list")
parser.add_argument("--source_dir", type=str, default="./dataset/44k", help="path to source dir")
args = parser.parse_args()
train = []
val = []
test = []
idx = 0
spk_dict = {}
spk_id = 0
for speaker in tqdm(os.listdir(args.source_dir)):
spk_dict[speaker] = spk_id
spk_id += 1
wavs = ["/".join([args.source_dir, speaker, i]) for i in os.listdir(os.path.join(args.source_dir, speaker))]
new_wavs = []
for file in wavs:
if not file.endswith("wav"):
continue
if not pattern.match(file):
print(f"warning文件名{file}中包含非字母数字下划线,可能会导致错误。(也可能不会)")
if get_wav_duration(file) < 0.3:
print("skip too short audio:", file)
continue
new_wavs.append(file)
wavs = new_wavs
shuffle(wavs)
train += wavs[2:-2]
val += wavs[:2]
test += wavs[-2:]
shuffle(train)
shuffle(val)
shuffle(test)
print("Writing", args.train_list)
with open(args.train_list, "w") as f:
for fname in tqdm(train):
wavpath = fname
f.write(wavpath + "\n")
print("Writing", args.val_list)
with open(args.val_list, "w") as f:
for fname in tqdm(val):
wavpath = fname
f.write(wavpath + "\n")
print("Writing", args.test_list)
with open(args.test_list, "w") as f:
for fname in tqdm(test):
wavpath = fname
f.write(wavpath + "\n")
config_template["spk"] = spk_dict
print("Writing configs/config.json")
with open("configs/config.json", "w") as f:
json.dump(config_template, f, indent=2)