Update README.md

This commit is contained in:
Miuzarte 2023-05-23 15:37:20 +08:00
parent b1064420cd
commit d0d8ec2f36
2 changed files with 21 additions and 34 deletions

View File

@ -8,7 +8,7 @@
#### ✨ A client supports real-time conversion: [w-okada/voice-changer](https://github.com/w-okada/voice-changer)
#### This project is fundamentally different from Vits. Vits is TTS and this project is SVC. TTS cannot be carried out in this project, and Vits cannot carry out SVC, and the two project models are not universal
**This project is fundamentally different from Vits. Vits is TTS and this project is SVC. TTS cannot be carried out in this project, and Vits cannot carry out SVC, and the two project models are not universal.**
## Announcement
@ -16,7 +16,7 @@ The project was developed to allow the developers' favorite anime characters to
## Disclaimer
This project is an open source, offline project, and all members of SvcDevelopTeam and all developers and maintainers of this project (hereinafter referred to as contributors) have no control over this project. The contributor of this project has never provided any organization or individual with any form of assistance, including but not limited to data set extraction, data set processing, computing support, training support, infering, etc. Contributors to the project do not and cannot know what users are using the project for. Therefore, all AI models and synthesized audio based on the training of this project have nothing to do with the contributors of this project. All problems arising therefrom shall be borne by the user.
This project is an open source, offline project, and all members of SvcDevelopTeam and all developers and maintainers of this project (hereinafter referred to as contributors) have no control over this project. The contributor of this project has never provided any organization or individual with any form of assistance, including but not limited to data set extraction, data set processing, computing support, training support, infering, etc. Contributors to the project do not and cannot know what users are using the project for. Therefore, all AI models and synthesized audio based on the training of this project have nothing to do with the contributors of this project. All problems arising therefrom shall be borne by the user.
This project is run completely offline and cannot collect any user information or obtain user input data. Therefore, contributors to this project are not aware of all user input and models and therefore are not responsible for any user input.
@ -33,10 +33,6 @@ This project is only a framework project, which does not have the function of sp
5. Continuing to use this project is deemed as agreeing to the relevant provisions stated in this repository README. This repository README has the obligation to persuade, and is not responsible for any subsequent problems that may arise.
6. If you use this project for any other plan, please contact and inform the author of this repository in advance. Thank you very much.
## 🆕 Update!
> Updated the 4.0-v2 model, the entire process is the same as 4.0. Compared to 4.0, there is some improvement in certain scenarios, but there are also some cases where it has regressed. Please refer to the [4.0-v2 branch](https://github.com/svc-develop-team/so-vits-svc/tree/4.0-v2) for more information.
## 📝 Model Introduction
The singing voice conversion model uses SoftVC content encoder to extract source audio speech features, then the vectors are directly fed into VITS instead of converting to a text based intermediate; thus the pitch and intonations are conserved. Additionally, the vocoder is changed to [NSF HiFiGAN](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan) to solve the problem of sound interruption.
@ -45,7 +41,7 @@ The singing voice conversion model uses SoftVC content encoder to extract source
- Feature input is changed to [Content Vec](https://github.com/auspicious3000/contentvec) Transformer output of 12 layer, And compatible with 4.0 branches.
- Update the shallow diffusion, you can use the shallow diffusion model to improve the sound quality.
### 🆕 Questions about compatibility with the 4.0 model
- You can support the 4.0 model by modifying the config.json of the 4.0 model, adding the speech_encoder field to the Model field of config.json, see below for details
@ -73,7 +69,6 @@ After conducting tests, we believe that the project runs stably on `Python 3.8.9
**The following encoder needs to select one to use**
##### **1. If using contentvec as sound encoder**
- ContentVec: [checkpoint_best_legacy_500.pt](https://ibm.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr)
- Place it under the `pretrain` directory
@ -111,7 +106,7 @@ wget -P pretrain/ http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best
Get Sovits Pre-trained model from svc-develop-team(TBD) or anywhere else.
Diffusion Model reference [DDSP-SVC](https://github.com/yxlllc/DDSP-SVC) Diffusion Model, Pre-trained model [DDSP-SVC](https://github.com/yxlllc/DDSP-SVC) and the diffusion model of Pre-trained model general, can go to [DDSP-SVC](https://github.com/yxlllc/DDSP-SVC) to obtain the diffusion model of the Pre-trained model.
Diffusion model references [DDSP-SVC](https://github.com/yxlllc/DDSP-SVC) diffusion model. The pre-trained diffusion model is universal with the DDSP-SVC's. You can go to [DDSP-SVC](https://github.com/yxlllc/DDSP-SVC) to get the pre-trained diffusion model.
Although the pretrained model generally does not cause any copyright problems, please pay attention to it. For example, ask the author in advance, or the author has indicated the feasible use in the description clearly.
@ -125,6 +120,7 @@ If you are using the `NSF-HIFIGAN enhancer` or `shallow diffusion`, you will nee
```shell
# nsf_hifigan
wget -P pretrain/ https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
\unzip -od pretrain/nsf_hifigan pretrain/nsf_hifigan_20221211.zip
# Alternatively, you can manually download and place it in the pretrain/nsf_hifigan directory
# URLhttps://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1
```
@ -189,7 +185,6 @@ hubertsoft
If the speech_encoder argument is omitted, the default value is vec768l12
### 3. Generate hubert and f0
```shell
@ -267,13 +262,12 @@ Optional parameters: see the next section
- `-eh` | `--enhance`: Whether to use NSF_HIFIGAN enhancer, this option has certain effect on sound quality enhancement for some models with few training sets, but has negative effect on well-trained models, so it is turned off by default.
- `-shd` | `--shallow_diffusion`Whether to use shallow diffusion, which can solve some electrical sound problems after use. This option is turned off by default. When this option is enabled, NSF_HIFIGAN intensifier will be disabled
Shallow diffusion setting
Shallow diffusion settings:
+ `-dm` | `--diffusion_model_path`Diffusion model path
+ `-dc` | `--diffusion_config_path`Diffusion model profile path
+ `-ks` | `--k_step`The larger the number of diffusion steps, the closer it is to the result of the diffusion model. The default is 100
+ `-od` | `---only_diffusion`Only diffusion mode, which does not load the sovits model to the diffusion model inference
## 🤔 Optional Settings
If the results from the previous section are satisfactory, or if you didn't understand what is being discussed in the following section, you can skip it, and it won't affect the model usage. (These optional settings have a relatively small impact, and they may have some effect on certain specific data, but in most cases, the difference may not be noticeable.)
@ -297,11 +291,7 @@ The existing steps before clustering do not need to be changed. All you need to
- Specify `cluster_model_path` in `inference_main.py`.
- Specify `cluster_infer_ratio` in `inference_main.py`, where `0` means not using clustering at all, `1` means only using clustering, and usually `0.5` is sufficient.
### [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/svc-develop-team/so-vits-svc/blob/4.0-Vec768-Layer12/sovits4_for_colab.ipynb) [sovits4_for_colab.ipynb](https://colab.research.google.com/github/svc-develop-team/so-vits-svc/blob/4.0-Vec768-Layer12/sovits4_for_colab.ipynb)
**[23/03/16] No longer need to download hubert manually**
**[23/04/14] Support NSF_HIFIGAN enhancer**
### [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/svc-develop-team/so-vits-svc/blob/4.1-Stable/sovits4_for_colab.ipynb) [sovits4_for_colab.ipynb](https://colab.research.google.com/github/svc-develop-team/so-vits-svc/blob/4.1-Stable/sovits4_for_colab.ipynb)
## 📤 Exporting to Onnx

View File

@ -8,6 +8,8 @@
#### ✨ 支持实时转换的一个客户端:[w-okada/voice-changer](https://github.com/w-okada/voice-changer)
**本项目与Vits有着根本上的不同。Vits是TTS本项目是SVC。本项目无法实现TTSVits也无法实现SVC这两个项目的模型是完全不通用的。**
## 重要通知
这个项目是为了让开发者最喜欢的动画角色唱歌而开发的,任何涉及真人的东西都与开发者的意图背道而驰。
@ -18,7 +20,7 @@
此项目完全离线运行,不能收集任何用户信息或获取用户输入数据。因此,这个项目的贡献者不知道所有的用户输入和模型,因此不负责任何用户输入。
本项目只是一个框架项目,本身并没有语音合成的功能,所有的功能都需要用户自己训练模型。同时,这个项目没有任何模型,任何二次分发的项目都与这个项目的贡献者无关
本项目只是一个框架项目,本身并没有语音合成的功能,所有的功能都需要用户自己训练模型。同时,这个项目没有任何模型,任何二次分发的项目都与这个项目的贡献者无关
## 📏 使用规约
@ -31,13 +33,9 @@
5. 继续使用视为已同意本仓库 README 所述相关条例,本仓库 README 已进行劝导义务,不对后续可能存在问题负责。
6. 如果将此项目用于任何其他企划,请提前联系并告知本仓库作者,十分感谢。
## 🆕 Update!
> 更新了4.0-v2模型全部流程同4.0相比4.0在部分场景下有一定提升,但也有些情况有退步,具体可移步[4.0-v2分支](https://github.com/svc-develop-team/so-vits-svc/tree/4.0-v2)
## 📝 模型简介
歌声音色转换模型通过SoftVC内容编码器提取源音频语音特征与F0同时输入VITS替换原本的文本输入达到歌声转换的效果。同时更换声码器为 [NSF HiFiGAN](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan) 解决断音问题
歌声音色转换模型通过SoftVC内容编码器提取源音频语音特征与F0同时输入VITS替换原本的文本输入达到歌声转换的效果。同时更换声码器为 [NSF HiFiGAN](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan)解决断音问题。
### 🆕 4.1-Stable 版本更新内容
@ -106,7 +104,7 @@ wget -P pretrain/ http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best
+ 扩散模型预训练底模文件: `model_0.pt `
+ 放在`logs/44k/diffusion`目录下
Sovits底模从svc-develop-team(待定)或任何其他地方获取
从svc-develop-team(待定)或任何其他地方获取Sovits底模
扩散模型引用了[DDSP-SVC](https://github.com/yxlllc/DDSP-SVC)的Diffusion Model底模与[DDSP-SVC](https://github.com/yxlllc/DDSP-SVC)的扩散模型底模通用,可以去[DDSP-SVC](https://github.com/yxlllc/DDSP-SVC)获取扩散模型的底模
@ -122,6 +120,7 @@ Sovits底模从svc-develop-team(待定)或任何其他地方获取
```shell
# nsf_hifigan
wget -P pretrain/ https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
unzip -od pretrain/nsf_hifigan pretrain/nsf_hifigan_20221211.zip
# 也可手动下载放在pretrain/nsf_hifigan目录
# 地址https://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1
```
@ -156,7 +155,7 @@ dataset_raw
### 0. 音频切片
将音频切片至`5s - 15s`, 稍微长点也无伤大雅,实在太长可能会导致训练中途甚至预处理就爆显存
将音频切片至`5s - 15s`, 稍微长点也无伤大雅,实在太长可能会导致训练中途甚至预处理就爆显存
可以使用[audio-slicer-GUI](https://github.com/flutydeer/audio-slicer)、[audio-slicer-CLI](https://github.com/openvpi/audio-slicer)
@ -177,6 +176,7 @@ python preprocess_flist_config.py --speech_encoder vec768l12
```
speech_encoder拥有三个选择
```
vec768l12
vec256l9
@ -192,6 +192,7 @@ python preprocess_hubert_f0.py --f0_predictor dio
```
f0_predictor拥有四个选择
```
crepe
dio
@ -244,7 +245,7 @@ python train.py -c configs/config.json -m 44k
python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "君の知らない物語-src.wav" -t 0 -s "nen"
```
必填项部分
必填项部分
+ `-m` | `--model_path`:模型路径
+ `-c` | `--config_path`:配置文件路径
+ `-n` | `--clean_names`wav 文件名列表,放在 raw 文件夹下
@ -261,7 +262,7 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
+ `-eh` | `--enhance`是否使用NSF_HIFIGAN增强器,该选项对部分训练集少的模型有一定的音质增强效果,但是对训练好的模型有反面效果,默认关闭
+ `-shd` | `--shallow_diffusion`是否使用浅层扩散使用后可解决一部分电音问题默认关闭该选项打开时NSF_HIFIGAN增强器将会被禁止
浅扩散设置
浅扩散设置
+ `-dm` | `--diffusion_model_path`:扩散模型路径
+ `-dc` | `--diffusion_config_path`:扩散模型配置文件路径
+ `-ks` | `--k_step`扩散步数越大越接近扩散模型的结果默认100
@ -278,23 +279,19 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
### 聚类音色泄漏控制
介绍:聚类方案可以减小音色泄漏,使得模型训练出来更像目标的音色(但其实不是特别明显),但是单纯的聚类方案会降低模型的咬字(会口齿不清)(这个很明显),本模型采用了融合的方式,可以线性控制聚类方案与非聚类方案的占比,也就是可以手动在"像目标音色" 和 "咬字清晰" 之间调整比例,找到合适的折中点
介绍:聚类方案可以减小音色泄漏,使得模型训练出来更像目标的音色(但其实不是特别明显),但是单纯的聚类方案会降低模型的咬字(会口齿不清)(这个很明显),本模型采用了融合的方式,可以线性控制聚类方案与非聚类方案的占比,也就是可以手动在"像目标音色" 和 "咬字清晰" 之间调整比例,找到合适的折中点
使用聚类前面的已有步骤不用进行任何的变动,只需要额外训练一个聚类模型,虽然效果比较有限,但训练成本也比较低
+ 训练过程:
+ 使用cpu性能较好的机器训练据我的经验在腾讯云6核cpu训练每个speaker需要约4分钟即可完成训练
+ 执行`python cluster/train_cluster.py` ,模型的输出会在`logs/44k/kmeans_10000.pt`
+ 执行`python cluster/train_cluster.py`,模型的输出会在`logs/44k/kmeans_10000.pt`
+ 聚类模型目前可以使用gpu进行训练执行`python cluster/train_cluster.py --gpu`
+ 推理过程:
+ `inference_main.py`中指定`cluster_model_path`
+ `inference_main.py`中指定`cluster_infer_ratio``0`为完全不使用聚类,`1`为只使用聚类,通常设置`0.5`即可
### [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/svc-develop-team/so-vits-svc/blob/4.0-Vec768-Layer12/sovits4_for_colab.ipynb) [sovits4_for_colab.ipynb](https://colab.research.google.com/github/svc-develop-team/so-vits-svc/blob/4.0-Vec768-Layer12/sovits4_for_colab.ipynb)
**[23/03/16] 不再需要手动下载hubert**
**[23/04/14] 支持NSF_HIFIGAN增强器**
### [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/svc-develop-team/so-vits-svc/blob/4.1-Stable/sovits4_for_colab.ipynb) [sovits4_for_colab.ipynb](https://colab.research.google.com/github/svc-develop-team/so-vits-svc/blob/4.1-Stable/sovits4_for_colab.ipynb)
## 📤 Onnx导出