Update README.md

This commit is contained in:
Miuzarte 2023-04-09 11:36:55 +08:00
parent ec3ed074b8
commit 7fd193a1f8
2 changed files with 47 additions and 20 deletions

View File

@ -38,7 +38,7 @@ The singing voice conversion model uses SoftVC content encoder to extract source
## 💬 About Python Version
After conducting tests, we believe that the project runs stably on Python version 3.8.9.
After conducting tests, we believe that the project runs stably on `Python 3.8.9`.
## 📥 Pre-trained Model Files
@ -64,10 +64,10 @@ Although the pretrained model generally does not cause any copyright problems, p
#### **Optional(Select as Required)**
If you are using the NSF-HIFIGAN enhancer, you will need to download the pre-trained NSF-HIFIGAN model, or not if you do not need to download.
If you are using the NSF-HIFIGAN enhancer, you will need to download the pre-trained NSF-HIFIGAN model, or not if you do not need it.
- Pre-trained NSF-HIFIGAN Vocoder: [nsf_hifigan_20221211.zip](https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip)
- After unzipping, place putting the four files in the 'pretrain/nsf_hifigan' directory
- Unzip and place the four files under the `pretrain/nsf_hifigan` directory
```shell
# nsf_hifigan
@ -104,13 +104,23 @@ dataset_raw
## 🛠️ Preprocessing
0. Slice audio
Slice to `5s - 15s`, a bit longer is no problem. Too long may lead to `torch.cuda.OutOfMemoryError` during training or even pre-processing.
By using [audio-slicer-GUI](https://github.com/flutydeer/audio-slicer) or [audio-slicer-CLI](https://github.com/openvpi/audio-slicer)
In general, only the `Minimum Interval` needs to be adjusted. For statement audio it usually remains default. For singing materials it can be adjusted to `100` or even `50`.
After slicing, delete audio that is too long and too short.
1. Resample to 44100Hz and mono
```shell
python resample.py
```
2. Automatically split the dataset into training and validation sets, and generate configuration files
2. Automatically split the dataset into training and validation sets, and generate configuration files.
```shell
python preprocess_flist_config.py
@ -155,11 +165,11 @@ Required parameters:
Optional parameters: see the next section
- `-lg` | `--linear_gradient`: The cross fade length of two audio slices in seconds. If there is a discontinuous voice after forced slicing, you can adjust this value. Otherwise, it is recommended to use the default value of 0.
- `-fmp` | `--f0_mean_pooling`: Apply mean filter (pooling) to f0which may improve some hoarse sounds. Enabling this option will reduce inference speed.
- `-fmp` | `--f0_mean_pooling`: Apply mean filter (pooling) to f0, which may improve some hoarse sounds. Enabling this option will reduce inference speed.
- `-a` | `--auto_predict_f0`: automatic pitch prediction for voice conversion, do not enable this when converting songs as it can cause serious pitch issues.
- `-cm` | `--cluster_model_path`: path to the clustering model, fill in any value if clustering is not trained.
- `-cr` | `--cluster_infer_ratio`: proportion of the clustering solution, range 0-1, fill in 0 if the clustering model is not trained.
- `-eh` | `--enhance`: Whether to use NSF_HIFIGAN enhancer, this option has certain effect on sound quality enhancement for some models with few training sets, but has negative effect on well-trained models, so it is turned off by default
- `-eh` | `--enhance`: Whether to use NSF_HIFIGAN enhancer, this option has certain effect on sound quality enhancement for some models with few training sets, but has negative effect on well-trained models, so it is turned off by default.
## 🤔 Optional Settings
@ -177,16 +187,17 @@ Introduction: The clustering scheme can reduce timbre leakage and make the train
The existing steps before clustering do not need to be changed. All you need to do is to train an additional clustering model, which has a relatively low training cost.
- Training process:
- Train on a machine with a good CPU performance. According to my experience, it takes about 4 minutes to train each speaker on a Tencent Cloud 6-core CPU.
- Execute "python cluster/train_cluster.py". The output of the model will be saved in "logs/44k/kmeans_10000.pt".
- Train on a machine with good CPU performance. According to my experience, it takes about 4 minutes to train each speaker on a Tencent Cloud machine with 6-core CPU.
- Execute `python cluster/train_cluster.py`. The output model will be saved in `logs/44k/kmeans_10000.pt`.
- Inference process:
- Specify "cluster_model_path" in inference_main.
- Specify "cluster_infer_ratio" in inference_main, where 0 means not using clustering at all, 1 means only using clustering, and usually 0.5 is sufficient.
- Specify `cluster_model_path` in `inference_main.py`.
- Specify `cluster_infer_ratio` in `inference_main.py`, where `0` means not using clustering at all, `1` means only using clustering, and usually `0.5` is sufficient.
### F0 mean filtering
Introduction: The mean filtering of F0 can effectively reduce the hoarse sound caused by the predicted fluctuation of pitch (the hoarse sound caused by reverb or harmony can not be eliminated temporarily). This function has been greatly improved on some songs. However, some songs are out of tune. If the song appears dumb after reasoning, it can be considered to open.
- Set f0_mean_pooling to true in inference_main
- Set `f0_mean_pooling` to true in `inference_main.py`
### [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1kv-3y2DmZo0uya8pEr1xk7cSB-4e_Pct?usp=sharing) [sovits4_for_colab.ipynb](https://colab.research.google.com/drive/1kv-3y2DmZo0uya8pEr1xk7cSB-4e_Pct?usp=sharing)
@ -206,8 +217,11 @@ Use [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/on
### UI support for Onnx models
- [MoeSS](https://github.com/NaruseMioShirakana/MoeSS)
- [Hubert4.0](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel)
Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently, they cannot be exported on their own (Hubert in fairseq has many unsupported operators and things involving constants that can cause errors or result in problems with the input/output shape and results when exported.) [Hubert4.0](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel)。CppDataProcess are some functions to preprocess data used in MoeSS
Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently, they cannot be exported on their own (Hubert in fairseq has many unsupported operators and things involving constants that can cause errors or result in problems with the input/output shape and results when exported.)
CppDataProcess are some functions to preprocess data used in MoeSS
## ☀️ Previous contributors

View File

@ -38,7 +38,7 @@
## 💬 关于 Python 版本问题
我们在进行测试后,认为 Python 3.8.9 版本能够稳定地运行该项目
在进行测试后,我们认为`Python 3.8.9`能够稳定地运行该项目
## 📥 预先下载的模型文件
@ -104,6 +104,16 @@ dataset_raw
## 🛠️ 数据预处理
0. 音频切片
将音频切片至`5s - 15s`, 稍微长点也无伤大雅,实在太长可能会导致训练中途甚至预处理就爆显存。
可以使用[audio-slicer-GUI](https://github.com/flutydeer/audio-slicer)、[audio-slicer-CLI](https://github.com/openvpi/audio-slicer)
一般情况下只需调整其中的`Minimum Interval`,普通陈述素材通常保持默认即可,歌唱素材可以调整至`100`甚至`50`
切完之后手动删除过长过短的音频
1. 重采样至44100Hz单声道
```shell
@ -178,16 +188,16 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
+ 训练过程:
+ 使用cpu性能较好的机器训练据我的经验在腾讯云6核cpu训练每个speaker需要约4分钟即可完成训练
+ 执行python cluster/train_cluster.py ,模型的输出会在 logs/44k/kmeans_10000.pt
+ 执行`python cluster/train_cluster.py` ,模型的输出会在`logs/44k/kmeans_10000.pt`
+ 推理过程:
+ inference_main中指定cluster_model_path
+ inference_main中指定cluster_infer_ratio0为完全不使用聚类1为只使用聚类通常设置0.5即可
+ `inference_main.py`中指定`cluster_model_path`
+ `inference_main.py`中指定`cluster_infer_ratio``0`为完全不使用聚类,`1`为只使用聚类,通常设置`0.5`即可
### F0均值滤波
介绍对F0进行均值滤波可以有效的减少因音高推测波动造成的哑音由于混响或和声等造成的哑音暂时不能消除。该功能在部分歌曲上提升巨大但是在部分歌曲上会出现跑调的问题。如果歌曲推理后出现哑音可以考虑开启。
+ 在inference_main中设置f0_mean_pooling为true即可
+ 在`inference_main.py`中设置`f0_mean_pooling`为true即可
### [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1kv-3y2DmZo0uya8pEr1xk7cSB-4e_Pct?usp=sharing) [sovits4_for_colab.ipynb](https://colab.research.google.com/drive/1kv-3y2DmZo0uya8pEr1xk7cSB-4e_Pct?usp=sharing)
@ -196,6 +206,7 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
## 📤 Onnx导出
使用 [onnx_export.py](onnx_export.py)
+ 新建文件夹:`checkpoints` 并打开
+ 在`checkpoints`文件夹中新建一个文件夹作为项目文件夹,文件夹名为你的项目名称,比如`aziplayer`
+ 将你的模型更名为`model.pth`,配置文件更名为`config.json`,并放置到刚才创建的`aziplayer`文件夹下
@ -206,9 +217,11 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
### Onnx模型支持的UI
+ [MoeSS](https://github.com/NaruseMioShirakana/MoeSS)
+ 我去除了所有的训练用函数和一切复杂的转置一行都没有保留因为我认为只有去除了这些东西才知道你用的是Onnx
+ 注意Hubert Onnx模型请使用MoeSS提供的模型目前无法自行导出fairseq中Hubert有不少onnx不支持的算子和涉及到常量的东西在导出时会报错或者导出的模型输入输出shape和结果都有问题
[Hubert4.0](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel)
+ [Hubert4.0](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel)
注意Hubert Onnx模型请使用MoeSS提供的模型目前无法自行导出fairseq中Hubert有不少onnx不支持的算子和涉及到常量的东西在导出时会报错或者导出的模型输入输出shape和结果都有问题
CppDataProcess中是一些在MoeSS里处理音频的功能
## ☀️ 旧贡献者