Merge pull request #154 from ylzz1997/4.0-768

4.0-768
2023-04-15 11:34:58 +08:00 · 2023-04-15 11:34:58 +08:00 · 57c75e3ecb
parent 237f351089 6386a68eee
commit 57c75e3ecb
4 changed files with 8 additions and 23 deletions
--- a/README.md
+++ b/README.md
@ -29,16 +29,9 @@ This project is an open source, offline project, and all members of SvcDevelopTe

 The singing voice conversion model uses SoftVC content encoder to extract source audio speech features, then the vectors are directly fed into VITS instead of converting to a text based intermediate; thus the pitch and intonations are conserved. Additionally, the vocoder is changed to [NSF HiFiGAN](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan) to solve the problem of sound interruption.

-### 🆕 4.0 Version Update Content
+### 🆕 4.0-Vec768-Layer12 Version Update Content

- Feature input is changed to [Content Vec](https://github.com/auspicious3000/contentvec)
- The sampling rate is unified to use 44100Hz
- Due to the change of hop size and other parameters, as well as the streamlining of some model structures, the required GPU memory for inference is **significantly reduced**. The 44kHz GPU memory usage of version 4.0 is even smaller than the 32kHz usage of version 3.0.
- Some code structures have been adjusted
- The dataset creation and training process are consistent with version 3.0, but the model is completely non-universal, and the data set needs to be fully pre-processed again.
- Added an option 1: automatic pitch prediction for vc mode, which means that you don't need to manually enter the pitch key when converting speech, and the pitch of male and female voices can be automatically converted. However, this mode will cause pitch shift when converting songs.
- Added option 2: reduce timbre leakage through k-means clustering scheme, making the timbre more similar to the target timbre.
- Added option 3: Added [NSF-HIFIGAN Enhancer](https://github.com/yxlllc/DDSP-SVC), which has certain sound quality enhancement effect on some models with few train-sets, but has negative effect on well-trained models, so it is closed by default
+- Feature input is changed to [Content Vec](https://github.com/auspicious3000/contentvec) Transformer output of 12 layer, the branch is not compatible with 4.0 model
  
 ## 💬 About Python Version

--- a/README_zh_CN.md
+++ b/README_zh_CN.md
@ -29,16 +29,9 @@

 歌声音色转换模型，通过SoftVC内容编码器提取源音频语音特征，与F0同时输入VITS替换原本的文本输入达到歌声转换的效果。同时，更换声码器为 [NSF HiFiGAN](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan) 解决断音问题

-### 🆕 4.0 版本更新内容
+### 🆕 4.0-Vec768-Layer12 版本更新内容

-+ 特征输入更换为 [Content Vec](https://github.com/auspicious3000/contentvec) 
-+ 采样率统一使用44100hz
-+ 由于更改了hop size等参数以及精简了部分模型结构，推理所需显存占用**大幅降低**，4.0版本44khz显存占用甚至小于3.0版本的32khz
-+ 调整了部分代码结构
-+ 数据集制作、训练过程和3.0保持一致，但模型完全不通用，数据集也需要全部重新预处理
-+ 增加了可选项 1：vc模式自动预测音高f0,即转换语音时不需要手动输入变调key，男女声的调能自动转换，但仅限语音转换，该模式转换歌声会跑调
-+ 增加了可选项 2：通过kmeans聚类方案减小音色泄漏，即使得音色更加像目标音色
-+ 增加了可选项 3：增加了[NSF-HIFIGAN增强器](https://github.com/yxlllc/DDSP-SVC)，对部分训练集少的模型有一定的音质增强效果，但是对训练好的模型有反面效果，默认关闭
+ 特征输入更换为 [Content Vec](https://github.com/auspicious3000/contentvec) 的第12层Transformer输出，该分支不兼容4.0的模型

 ## 💬 关于 Python 版本问题

--- a/configs_template/config_template.json
+++ b/configs_template/config_template.json
@ -52,8 +52,8 @@
    "upsample_kernel_sizes": [16,16, 4, 4, 4],
    "n_layers_q": 3,
    "use_spectral_norm": false,
-    "gin_channels": 256,
-    "ssl_dim": 256,
+    "gin_channels": 768,
+    "ssl_dim": 768,
    "n_speakers": 200
  },
  "spk": {
--- a/utils.py
+++ b/utils.py
@ -232,12 +232,11 @@ def get_hubert_content(hmodel, wav_16k_tensor):
  inputs = {
    "source": feats.to(wav_16k_tensor.device),
    "padding_mask": padding_mask.to(wav_16k_tensor.device),
-    "output_layer": 9,  # layer 9
+    "output_layer": 12,  # layer 12
  }
  with torch.no_grad():
    logits = hmodel.extract_features(**inputs)
-    feats = hmodel.final_proj(logits[0])
-  return feats.transpose(1, 2)
+  return logits[0].transpose(1, 2)


 def get_content(cmodel, y):