Updata Readme

2023-05-30 00:40:29 +08:00 · 2023-05-30 00:40:29 +08:00 · 807bb2adfb
parent 1664f1157e
commit 807bb2adfb
2 changed files with 101 additions and 26 deletions
--- a/README.md
+++ b/README.md
@ -42,6 +42,8 @@ The singing voice conversion model uses SoftVC content encoder to extract source
 - Feature input is changed to [Content Vec](https://github.com/auspicious3000/contentvec) Transformer output of 12 layer, And compatible with 4.0 branches.
 - Update the shallow diffusion, you can use the shallow diffusion model to improve the sound quality.
 - Added Whisper speech encoder support
+- Added static/dynamic sound fusion
+- Added loudness embedding
  
 ### 🆕 Questions about compatibility with the 4.0 model

@ -69,7 +71,7 @@ After conducting tests, we believe that the project runs stably on `Python 3.8.9

 **The following encoder needs to select one to use**

-##### **1. If using contentvec as speech encoder**
+##### **1. If using contentvec as speech encoder(recommended)**
 - ContentVec: [checkpoint_best_legacy_500.pt](https://ibm.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr)
  - Place it under the `pretrain` directory

@ -84,11 +86,11 @@ wget -P pretrain/ http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best
  - Place it under the `pretrain` directory

 ##### **3. If whisper-ppg as the encoder**
- download model at https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt
+- download model at [medium.pt](https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt)
  - Place it under the `pretrain` director
  
 ##### **4. If OnnxHubert/ContentVec as the encoder**
- download model at https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel/tree/main
+- download model at [MoeSS-SUBModel](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel/tree/main)
  - Place it under the `pretrain` directory

 #### **List of Encoders**
@ -224,6 +226,18 @@ whisper-ppg

 If the speech_encoder argument is omitted, the default value is vec768l12

+#### You can modify some parameters in the generated config.json and diffusion.yaml
+
+* `keep_ckpts`: Keep the last `keep_ckpts` models during training. Set to `0` will keep them all. Default is `3`.
+
+* `all_in_mem`, `cache_all_data`: Load all dataset to RAM. It can be enabled when the disk IO of some platforms is too low and the system memory is **much larger** than your dataset.
+  
+* `batch_size`: The amount of data loaded to the GPU for a single training session can be adjusted to a size lower than the video memory capacity.
+
+**Use loudness embedding**
+
+If loudness embedding is used, the 'vol_aug' and 'vol_embedding' in config.json will be set to true. After use, the trained model will match the loudness of the input source; otherwise, it will be the loudness of the training set.
+
 ### 3. Generate hubert and f0

 ```shell
@ -251,12 +265,6 @@ python preprocess_hubert_f0.py --f0_predictor dio --use_diff

 After completing the above steps, the dataset directory will contain the preprocessed data, and the dataset_raw folder can be deleted.

-#### You can modify some parameters in the generated config.json and diffusion.yaml
-
-* `keep_ckpts`: Keep the last `keep_ckpts` models during training. Set to `0` will keep them all. Default is `3`.
-
-* `all_in_mem`: Load all dataset to RAM. It can be enabled when the disk IO of some platforms is too low and the system memory is **much larger** than your dataset.
-
 ## 🏋️‍♀️ Training

 ### Diffusion Model (optional)
@ -299,13 +307,14 @@ Optional parameters: see the next section
 - `-cm` | `--cluster_model_path`: path to the clustering model, fill in any value if clustering is not trained.
 - `-cr` | `--cluster_infer_ratio`: proportion of the clustering solution, range 0-1, fill in 0 if the clustering model is not trained.
 - `-eh` | `--enhance`: Whether to use NSF_HIFIGAN enhancer, this option has certain effect on sound quality enhancement for some models with few training sets, but has negative effect on well-trained models, so it is turned off by default.
- `-shd` | `--shallow_diffusion`：Whether to use shallow diffusion, which can solve some electrical sound problems after use. This option is turned off by default. When this option is enabled, NSF_HIFIGAN intensifier will be disabled
-
+- `-shd` | `--shallow_diffusion`: Whether to use shallow diffusion, which can solve some electrical sound problems after use. This option is turned off by default. When this option is enabled, NSF_HIFIGAN intensifier will be disabled
+- `-usm` | `--use_spk_mix`: whether to use dynamic voice/merge their role
+  
 Shallow diffusion settings:
-+ `-dm` | `--diffusion_model_path`：Diffusion model path
-+ `-dc` | `--diffusion_config_path`：Diffusion model profile path
-+ `-ks` | `--k_step`：The larger the number of diffusion steps, the closer it is to the result of the diffusion model. The default is 100
-+ `-od` | `--only_diffusion`：Only diffusion mode, which does not load the sovits model to the diffusion model inference
+- `-dm` | `--diffusion_model_path`: Diffusion model path
+- `-dc` | `--diffusion_config_path`: Diffusion model profile path
+- `-ks` | `--k_step`: The larger the number of diffusion steps, the closer it is to the result of the diffusion model. The default is 100
+- `-od` | `--only_diffusion`: Only diffusion mode, which does not load the sovits model to the diffusion model inference

 ### Attention

@ -345,6 +354,35 @@ The generated model contains data that is needed for further training. If you co
 python compress_model.py -c="configs/config.json" -i="logs/44k/G_30400.pth" -o="logs/44k/release.pth"
 ```

+## 👨‍🔧 Timbre mixing
+
+### Stable Timbre mixing
+
+**Refer to `webui.py` file for stable Timbre mixing of the gadget/lab feature.**
+
+Introduction: This function can combine multiple sound models into one sound model (convex combination or linear combination of multiple model parameters) to create sound lines that do not exist in reality
+
+**Note:**
+1. This function only supports single-speaker models
+2. If the multi-speaker model is forced to be used, it is necessary to ensure that the number of speakers in multiple models is the same, so that the voices under the same SpaekerID can be mixed
+3. Ensure that the model fields in config.json of all models to be mixed are the same
+4. The output hybrid model can use any config.json of the model to be synthesized, but the clustering model will not be used
+5. When batch uploading models, it is best to put the models into a folder and upload them together after selecting them
+6. It is suggested to adjust the mixing ratio between 0 and 100, or to other numbers, but unknown effects will occur in the linear combination mode
+7. After mixing, the file named output.pth will be saved in the root directory of the project
+8. Convex combination mode will perform Softmax to add the mix ratio to 1, while linear combination mode will not
+
+### Dynamic timbre mixing
+
+**Refer to the `spkmix.py` file for an introduction to dynamic timbre mixing**
+
+Character mix track writing rules:
+Role ID: \[\[Start time 1, end time 1, start value 1, start value 1], [Start time 2, end time 2, start value 2]]
+The start time must be the same as the end time of the previous one. The first start time must be 0, and the last end time must be 1 (time ranges from 0 to 1).
+All roles must be filled in. For unused roles, fill \[\[0., 1., 0., 0.]]
+The fusion value can be filled in arbitrarily, and the linear change from the start value to the end value within the specified period of time. The internal linear combination will be automatically guaranteed to be 1 (convex combination condition), so it can be used safely
+Use the `--use_spk_mix` parameter when reasoning to enable dynamic timbre mixing
+
 ## 📤 Exporting to Onnx

 Use [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/onnx_export.py)
--- a/README_zh_CN.md
+++ b/README_zh_CN.md
@ -44,7 +44,9 @@
 + 特征输入更换为 [Content Vec](https://github.com/auspicious3000/contentvec) 的第12层Transformer输出，并兼容4.0分支
 + 更新浅层扩散，可以使用浅层扩散模型提升音质
 + 增加whisper语音编码器的支持
-
+ 增加静态/动态声线融合
+ 增加响度嵌入
+  
 ### 🆕 关于兼容4.0模型的问题

 + 可通过修改4.0模型的config.json对4.0的模型进行支持，需要在config.json的model字段中添加speech_encoder字段，具体见下
@ -71,7 +73,7 @@

 **以下编码器需要选择一个使用**

-##### **1. 若使用contentvec作为声音编码器**
+##### **1. 若使用contentvec作为声音编码器（推荐）**
 + contentvec ：[checkpoint_best_legacy_500.pt](https://ibm.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr)
  + 放在`pretrain`目录下

@ -86,11 +88,11 @@ wget -P pretrain/ http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best
  + 放在`pretrain`目录下

 ##### **3. 若使用Whisper-ppg作为声音编码器**
- download model at https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt
+- download model at [medium.pt](https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt)
  - 放在`pretrain`目录下
 
 ##### **4. 若使用OnnxHubert/ContentVec作为声音编码器**
- download model at https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel/tree/main
+- download model at [MoeSS-SUBModel](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel/tree/main)
  - 放在`pretrain`目录下

 #### **编码器列表**
@ -226,6 +228,18 @@ whisper-ppg

 如果省略speech_encoder参数，默认值为vec768l12

+#### 此时可以在生成的config.json与diffusion.yaml修改部分参数
+
+* `keep_ckpts`：训练时保留最后几个模型，`0`为保留所有，默认只保留最后`3`个
+
+* `all_in_mem`,`cache_all_data`：加载所有数据集到内存中，某些平台的硬盘IO过于低下、同时内存容量 **远大于** 数据集体积时可以启用
+
+* `batch_size`：单次训练加载到GPU的数据量，调整到低于显存容量的大小即可
+
+**使用响度嵌入**
+
+若使用响度嵌入，需要将config.json中的`vol_aug`,`vol_embedding`设置为true.使用后训练出的模型将匹配到输入源响度，否则为训练集响度。
+
 ### 3. 生成hubert与f0

 ```shell
@ -253,12 +267,6 @@ python preprocess_hubert_f0.py --f0_predictor dio --use_diff

 执行完以上步骤后 dataset 目录便是预处理完成的数据，可以删除 dataset_raw 文件夹了

-#### 此时可以在生成的config.json与diffusion.yaml修改部分参数
-
-* `keep_ckpts`：训练时保留最后几个模型，`0`为保留所有，默认只保留最后`3`个
-
-* `all_in_mem`：加载所有数据集到内存中，某些平台的硬盘IO过于低下、同时内存容量 **远大于** 数据集体积时可以启用
-
 ## 🏋️‍♀️ 训练

 ### 扩散模型（可选）
@ -302,7 +310,8 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
 + `-cr` | `--cluster_infer_ratio`：聚类方案占比，范围0-1，若没有训练聚类模型则默认0即可
 + `-eh` | `--enhance`：是否使用NSF_HIFIGAN增强器,该选项对部分训练集少的模型有一定的音质增强效果，但是对训练好的模型有反面效果，默认关闭
 + `-shd` | `--shallow_diffusion`：是否使用浅层扩散，使用后可解决一部分电音问题，默认关闭，该选项打开时，NSF_HIFIGAN增强器将会被禁止
-
+ `-usm` | `--use_spk_mix`：是否使用角色融合/动态声线融合
+  
 浅扩散设置：
 + `-dm` | `--diffusion_model_path`：扩散模型路径
 + `-dc` | `--diffusion_config_path`：扩散模型配置文件路径
@ -349,6 +358,34 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
 python compress_model.py -c="configs/config.json" -i="logs/44k/G_30400.pth" -o="logs/44k/release.pth"
 ```

+## 👨‍🔧 声线混合
+
+### 静态声线混合
+
+**参考`webUI.py`文件中，小工具/实验室特性的静态声线融合。**
+
+介绍:该功能可以将多个声音模型合成为一个声音模型(多个模型参数的凸组合或线性组合)，从而制造出现实中不存在的声线 
+**注意：**
+1.该功能仅支持单说话人的模型
+2.如果强行使用多说话人模型，需要保证多个模型的说话人数量相同，这样可以混合同一个SpaekerID下的声音
+3.保证所有待混合模型的config.json中的model字段是相同的
+4.输出的混合模型可以使用待合成模型的任意一个config.json，但聚类模型将不能使用
+5.批量上传模型的时候最好把模型放到一个文件夹选中后一起上传
+6.混合比例调整建议大小在0-100之间，也可以调为其他数字，但在线性组合模式下会出现未知的效果
+7.混合完毕后，文件将会保存在项目根目录中，文件名为output.pth
+8.凸组合模式会将混合比例执行Softmax使混合比例相加为1，而线性组合模式不会
+
+### 动态声线混合
+
+**参考`spkmix.py`文件中关于动态声线混合的介绍**
+
+角色混合轨道 编写规则：
+角色ID : \[\[起始时间1, 终止时间1, 起始数值1, 起始数值1], [起始时间2, 终止时间2, 起始数值2, 起始数值2]]
+起始时间和前一个的终止时间必须相同，第一个起始时间必须为0，最后一个终止时间必须为1 （时间的范围为0-1）
+全部角色必须填写，不使用的角色填\[\[0., 1., 0., 0.]]即可
+融合数值可以随便填，在指定的时间段内从起始数值线性变化为终止数值，内部会自动确保线性组合为1（凸组合条件），可以放心使用
+推理的时候使用`--use_spk_mix`参数即可启用动态声线混合
+
 ## 📤 Onnx导出

 使用 [onnx_export.py](onnx_export.py)