This commit is contained in:
zwa73 2023-03-31 01:29:25 +08:00
commit 0098c87d1e
18 changed files with 643 additions and 113 deletions

124
.github/ISSUE_TEMPLATE/ask_for_help.yaml vendored Normal file
View File

@ -0,0 +1,124 @@
name: 请求帮助
description: 遇到了无法自行解决的错误
title: '[Help]: '
labels: [ "help wanted" ]
body:
- type: markdown
attributes:
value: |
#### 提问前请先自己去尝试解决,比如查看[本仓库wiki中的Quick solution](https://github.com/svc-develop-team/so-vits-svc/wiki/Quick-solution)也可以借助chatgpt或一些搜索引擎谷歌/必应/New Bing/StackOverflow等等。如果实在无法自己解决再发issue在提issue之前请先了解《[提问的智慧](https://github.com/ryanhanwu/How-To-Ask-Questions-The-Smart-Way/blob/main/README-zh_CN.md)》。
---
### 什么样的issue会被直接close
1. 伸手党
2. 一键包/环境包相关
3. 提供的信息不全
4. 低级的如缺少依赖而导致无法运行的问题
4. 所用的数据集是无授权数据集(游戏角色/二次元人物暂不归为此类,但是训练时候也要小心谨慎。如果能联系到官方,必须先和官方联系并核实清楚)
---
- type: checkboxes
id: Clause
attributes:
label: 请勾选下方的确认框。
options:
- label: "我已仔细阅读[README.md](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/README_zh_CN.md)和[wiki中的Quick solution](https://github.com/svc-develop-team/so-vits-svc/wiki/Quick-solution)。"
required: true
- label: "我已通过各种搜索引擎排查问题,我要提出的问题并不常见。"
required: true
- label: "我未在使用由第三方用户提供的一键包/环境包。"
required: true
- type: markdown
attributes:
value: |
# 请根据实际使用环境填写以下信息
- type: input
id: System
attributes:
label: 系统平台版本号
description: Windows执行`winver` | Linux执行`uname -a`
validations:
required: true
- type: input
id: GPU
attributes:
label: GPU 型号
description: 执行`nvidia-smi`
validations:
required: true
- type: input
id: PythonVersion
attributes:
label: Python版本
description: 执行`python -V`
validations:
required: true
- type: input
id: PyTorchVersion
attributes:
label: PyTorch版本
description: 执行`pip show torch`
validations:
required: true
- type: dropdown
id: Branch
attributes:
label: sovits分支
options:
- 4.0(默认)
- 4.0-v2
- 3.0-32k
- 3.0-48k
validations:
required: true
- type: input
id: DatasetSource
attributes:
label: 数据集来源(用于判断数据集质量)
description: UVR处理过的vtb直播音频、录音棚录制
validations:
required: true
- type: input
id: WhereOccurs
attributes:
label: 出现问题的环节或执行的命令
description: 如:预处理、训练、`python preprocess_hubert_f0.py`
validations:
required: true
- type: textarea
id: Description
attributes:
label: 问题描述
description: 在这里描述自己的问题,越详细越好
validations:
required: true
- type: textarea
id: Log
attributes:
label: 日志
description: 将从执行命令到执行完毕输出的所有信息(包括你所执行的命令)粘贴到[pastebin.com](https://pastebin.com/)并把剪贴板链接贴到这里
render: python
validations:
required: true
- type: textarea
id: ValidOneClick
attributes:
label: 截图`so-vits-svc`、`logs/44k`文件夹并粘贴到此处
validations:
required: true
- type: textarea
id: Supplementary
attributes:
label: 补充说明

View File

@ -0,0 +1,124 @@
name: Ask for help
description: Encountered an error cannot be resolved by self
title: '[Help]: '
labels: [ "help wanted" ]
body:
- type: markdown
attributes:
value: |
#### Please try to solve the problem yourself before asking for help. At first you can read *[Quick solution in wiki](https://github.com/svc-develop-team/so-vits-svc/wiki/Quick-solution)*. Then you can use chatgpt or some search engines like google, bing, new bing and StackOverflow until you really find that you can't solve it by yourself. And before you raise an issue, please understand *[How To Ask Questions The Smart Way](http://www.catb.org/~esr/faqs/smart-questions.html)* in advance.
---
### What kind of issue will be closed immediately
1. Beggars or Free Riders
2. One click package / Environment package (Not using `pip install -r requirement.txt`)
3. Incomplete information
4. Stupid issues such as miss a dependency package
4. Using unlicenced dataset (Game characters / anime characters are not included in this category temporarily but you still need to pay attention. If you can contact the official, you must contact the official and verify it at first.)
---
- type: checkboxes
id: Clause
attributes:
label: Please check the checkboxes below.
options:
- label: "I have read *[README.md](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/README.md)* and *[Quick solution in wiki](https://github.com/svc-develop-team/so-vits-svc/wiki/Quick-solution)* carefully."
required: true
- label: "I have been troubleshooting issues through various search engines. The questions I want to ask are not common."
required: true
- label: "I am NOT using one click package / environment package."
required: true
- type: markdown
attributes:
value: |
# Please fill in the following information according to your actual environment
- type: input
id: System
attributes:
label: OS version
description: Windows run `winver` | Linux run `uname -a`
validations:
required: true
- type: input
id: GPU
attributes:
label: GPU
description: Run `nvidia-smi`
validations:
required: true
- type: input
id: PythonVersion
attributes:
label: Python version
description: Run `python -V`
validations:
required: true
- type: input
id: PyTorchVersion
attributes:
label: PyTorch version
description: Run `pip show torch`
validations:
required: true
- type: dropdown
id: Branch
attributes:
label: Branch of sovits
options:
- 4.0(Default)
- 4.0-v2
- 3.0-32k
- 3.0-48k
validations:
required: true
- type: input
id: DatasetSource
attributes:
label: Dataset source (Used to judge the dataset quality)
description: Such as UVR-processed streaming audio / Recorded in recording studio
validations:
required: true
- type: input
id: WhereOccurs
attributes:
label: Where thr problem occurs or what command you executed
description: Such as Preprocessing / Training / `python preprocess_hubert_f0.py`
validations:
required: true
- type: textarea
id: Description
attributes:
label: Problem description
description: Describe your problem here, the more detailed the better.
validations:
required: true
- type: textarea
id: Log
attributes:
label: Log
description: All information output from the command you executed to the end of execution (include the command)
render: python
validations:
required: true
- type: textarea
id: ValidOneClick
attributes:
label: Screenshot `so-vits-svc` and `logs/44k` folders and paste here
validations:
required: true
- type: textarea
id: Supplementary
attributes:
label: Supplementary description

5
.github/ISSUE_TEMPLATE/config.yml vendored Normal file
View File

@ -0,0 +1,5 @@
blank_issues_enabled: false
contact_links:
- name: 讨论区 / Discussions
url: https://github.com/svc-develop-team/so-vits-svc/discussions
about: 简单的询问/讨论请转至讨论区或发起一个低优先级的Default issue / For simple inquiries / discussions, please go to the discussions or raise a low priority Default issue

7
.github/ISSUE_TEMPLATE/default.md vendored Normal file
View File

@ -0,0 +1,7 @@
---
name: Default issue
about: 如果模板中没有你想发起的issue类型可以选择此项但这个issue也许会获得一个较低的处理优先级 / If there is no issue type you want to raise, you can start with this one. But this issue maybe will get a lower priority to deal with.
title: ''
labels: 'not urgent'
assignees: ''
---

3
.github/no-response.yml vendored Normal file
View File

@ -0,0 +1,3 @@
daysUntilClose: 7
responseRequiredLabel: waiting response
closeComment: 由于缺少必要信息且没有回应,该 issue 已被自动关闭,如有需要补充的内容请回复并自行重新打开该 issue

View File

@ -2,24 +2,30 @@
[**English**](./README.md) | [**中文简体**](./README_zh_CN.md)
## Terms of Use
#### ✨ A fork with a greatly improved interface: [34j/so-vits-svc-fork](https://github.com/34j/so-vits-svc-fork)
1. This project is established for academic exchange purposes only and is intended for communication and learning purposes. It is not intended for production environments. Please solve the authorization problem of the dataset on your own. You shall be solely responsible for any problems caused by the use of non-authorized datasets for training and all consequences thereof.
#### ✨ A client supports real-time conversion: [w-okada/voice-changer](https://github.com/w-okada/voice-changer)
## 📏 Terms of Use
# Warning: Please solve the authorization problem of the dataset on your own. You shall be solely responsible for any problems caused by the use of non-authorized datasets for training and all consequences thereof.The repository and its maintainer, svc develop team, have nothing to do with the consequences!
1. This project is established for academic exchange purposes only and is intended for communication and learning purposes. It is not intended for production environments.
2. Any videos based on sovits that are published on video platforms must clearly indicate in the description that they are used for voice changing and specify the input source of the voice or audio, for example, using videos or audios published by others and separating the vocals as input source for conversion, which must provide clear original video or music links. If your own voice or other synthesized voices from other commercial vocal synthesis software are used as the input source for conversion, you must also explain it in the description.
3. You shall be solely responsible for any infringement problems caused by the input source. When using other commercial vocal synthesis software as input source, please ensure that you comply with the terms of use of the software. Note that many vocal synthesis engines clearly state in their terms of use that they cannot be used for input source conversion.
4. Continuing to use this project is deemed as agreeing to the relevant provisions stated in this repository README. This repository README has the obligation to persuade, and is not responsible for any subsequent problems that may arise.
5. If you distribute this repository's code or publish any results produced by this project publicly (including but not limited to video sharing platforms), please indicate the original author and code source (this repository).
6. If you use this project for any other plan, please contact and inform the author of this repository in advance. Thank you very much.
## Update
## 🆕 Update!
> Updated the 4.0-v2 model, the entire process is the same as 4.0. Compared to 4.0, there is some improvement in certain scenarios, but there are also some cases where it has regressed. Please refer to the [4.0-v2 branch](https://github.com/svc-develop-team/so-vits-svc/tree/4.0-v2) for more information.
## Model Introduction
## 📝 Model Introduction
The singing voice conversion model uses SoftVC content encoder to extract source audio speech features, then the vectors are directly fed into VITS instead of converting to a text based intermediate; thus the pitch and intonations are conserved. Additionally, the vocoder is changed to [NSF HiFiGAN](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan) to solve the problem of sound interruption.
### 4.0 Version Update Content
### 🆕 4.0 Version Update Content
- Feature input is changed to [Content Vec](https://github.com/auspicious3000/contentvec)
- The sampling rate is unified to use 44100Hz
@ -29,7 +35,11 @@ The singing voice conversion model uses SoftVC content encoder to extract source
- Added an option 1: automatic pitch prediction for vc mode, which means that you don't need to manually enter the pitch key when converting speech, and the pitch of male and female voices can be automatically converted. However, this mode will cause pitch shift when converting songs.
- Added option 2: reduce timbre leakage through k-means clustering scheme, making the timbre more similar to the target timbre.
## Pre-trained Model Files
## 💬 About Python Version
After conducting tests, we believe that the project runs stably on Python version 3.8.9.
## 📥 Pre-trained Model Files
#### **Required**
@ -51,11 +61,11 @@ Get them from svc-develop-team(TBD) or anywhere else.
Although the pretrained model generally does not cause any copyright problems, please pay attention to it. For example, ask the author in advance, or the author has indicated the feasible use in the description clearly.
## Dataset Preparation
## 📊 Dataset Preparation
Simply place the dataset in the `dataset_raw` directory with the following file structure.
```shell
```
dataset_raw
├───speaker0
│ ├───xxx1-xxx1.wav
@ -67,15 +77,25 @@ dataset_raw
└───xxx7-xxx007.wav
```
## Preprocessing
You can customize the speaker name.
1. Resample to 44100hz
```
dataset_raw
└───suijiSUI
├───1.wav
├───...
└───25788785-20221210-200143-856_01_(Vocals)_0_0.wav
```
## 🛠️ Preprocessing
1. Resample to 44100Hz and mono
```shell
python resample.py
```
2. Automatically split the dataset into training, validation, and test sets, and generate configuration files
2. Automatically split the dataset into training and validation sets, and generate configuration files
```shell
python preprocess_flist_config.py
@ -89,7 +109,7 @@ python preprocess_hubert_f0.py
After completing the above steps, the dataset directory will contain the preprocessed data, and the dataset_raw folder can be deleted.
## Training
## 🏋️‍♀️ Training
```shell
python train.py -c configs/config.json -m 44k
@ -97,7 +117,7 @@ python train.py -c configs/config.json -m 44k
Note: During training, the old models will be automatically cleared and only the latest three models will be kept. If you want to prevent overfitting, you need to manually backup the model checkpoints, or modify the configuration file `keep_ckpts` to 0 to never clear them.
## Inference
## 🤖 Inference
Use [inference_main.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/inference_main.py)
@ -109,27 +129,25 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
```
Required parameters:
- -m, --model_path: path to the model.
- -c, --config_path: path to the configuration file.
- -n, --clean_names: a list of wav file names located in the raw folder.
- -t, --trans: pitch adjustment, supports positive and negative (semitone) values.
- -s, --spk_list: target speaker name for synthesis.
- -cl, --clip: voice auto-split,set to 0 to turn off,duration in seconds.
Optional parameters: see the next section
- -a, --auto_predict_f0: automatic pitch prediction for voice conversion, do not enable this when converting songs as it can cause serious pitch issues.
- -cm, --cluster_model_path: path to the clustering model, fill in any value if clustering is not trained.
- -cr, --cluster_infer_ratio: proportion of the clustering solution, range 0-1, fill in 0 if the clustering model is not trained.
## Optional Settings
## 🤔 Optional Settings
If the results from the previous section are satisfactory, or if you didn't understand what is being discussed in the following section, you can skip it, and it won't affect the model usage. (These optional settings have a relatively small impact, and they may have some effect on certain specific data, but in most cases, the difference may not be noticeable.)
### Automatic f0 prediction
During the 4.0 model training, an f0 predictor is also trained, which can be used for automatic pitch prediction during voice conversion. However, if the effect is not good, manual pitch prediction can be used instead. But please do not enable this feature when converting singing voice as it may cause serious pitch shifting!
- Set "auto_predict_f0" to true in inference_main.
### Cluster-based timbre leakage control
@ -149,7 +167,7 @@ The existing steps before clustering do not need to be changed. All you need to
#### [23/03/16] No longer need to download hubert manually
## Exporting to Onnx
## 📤 Exporting to Onnx
Use [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/onnx_export.py)
@ -166,7 +184,27 @@ Use [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/on
Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently, they cannot be exported on their own (Hubert in fairseq has many unsupported operators and things involving constants that can cause errors or result in problems with the input/output shape and results when exported.) [Hubert4.0](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel)
## Some legal provisions for reference
## ☀️ Previous contributors
For some reason the author deleted the original repository. Because of the negligence of the organization members, the contributor list was cleared because all files were directly reuploaded to this repository at the beginning of the reconstruction of this repository. Now add a previous contributor list to README.md.
*Some members have not listed according to their personal wishes.*
<table>
<tr>
<td align="center"><a href="https://github.com/MistEO"><img src="https://avatars.githubusercontent.com/u/18511905?v=4" width="100px;" alt=""/><br /><sub><b>MistEO</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/XiaoMiku01"><img src="https://avatars.githubusercontent.com/u/54094119?v=4" width="100px;" alt=""/><br /><sub><b>XiaoMiku01</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/ForsakenRei"><img src="https://avatars.githubusercontent.com/u/23041178?v=4" width="100px;" alt=""/><br /><sub><b>しぐれ</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/TomoGaSukunai"><img src="https://avatars.githubusercontent.com/u/25863522?v=4" width="100px;" alt=""/><br /><sub><b>TomoGaSukunai</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/Plachtaa"><img src="https://avatars.githubusercontent.com/u/112609742?v=4" width="100px;" alt=""/><br /><sub><b>Plachtaa</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/zdxiaoda"><img src="https://avatars.githubusercontent.com/u/45501959?v=4" width="100px;" alt=""/><br /><sub><b>zd小达</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/Archivoice"><img src="https://avatars.githubusercontent.com/u/107520869?v=4" width="100px;" alt=""/><br /><sub><b>凍聲響世</b></sub></a><br /></td>
</tr>
</table>
## 📚 Some legal provisions for reference
#### Any country, region, organization, or individual using this project must comply with the following laws.
#### 《民法典》
@ -184,3 +222,14 @@ Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently
【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象,含有侮辱、诽谤内容,侵害他人名誉权的,受害人有权依法请求该行为人承担民事责任。
行为人发表的文学、艺术作品不以特定人为描述对象,仅其中的情节与该特定人的情况相似的,不承担民事责任。
#### 《[中华人民共和国宪法](http://www.gov.cn/guoqing/2018-03/22/content_5276318.htm)》
#### 《[中华人民共和国刑法](http://gongbao.court.gov.cn/Details/f8e30d0689b23f57bfc782d21035c3.html?sw=%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD%E5%88%91%E6%B3%95)》
#### 《[中华人民共和国民法典](http://gongbao.court.gov.cn/Details/51eb6750b8361f79be8f90d09bc202.html)》
## 💪 Thanks to all contributors for their efforts
<a href="https://github.com/svc-develop-team/so-vits-svc/graphs/contributors" target="_blank">
<img src="https://contrib.rocks/image?repo=svc-develop-team/so-vits-svc" />
</a>

View File

@ -2,24 +2,30 @@
[**English**](./README.md) | [**中文简体**](./README_zh_CN.md)
## 使用规约
#### ✨ 改善了交互的一个分支推荐:[34j/so-vits-svc-fork](https://github.com/34j/so-vits-svc-fork)
1. 本项目是基于学术交流目的建立,仅供交流与学习使用,并非为生产环境准备,请自行解决数据集的授权问题,任何由于使用非授权数据集进行训练造成的问题,需自行承担全部责任和一切后果!
#### ✨ 支持实时转换的一个客户端:[w-okada/voice-changer](https://github.com/w-okada/voice-changer)
## 📏 使用规约
# Warning请自行解决数据集授权问题禁止使用非授权数据集进行训练任何由于使用非授权数据集进行训练造成的问题需自行承担全部责任和后果与仓库、仓库维护者、svc develop team 无关!
1. 本项目是基于学术交流目的建立,仅供交流与学习使用,并非为生产环境准备。
2. 任何发布到视频平台的基于 sovits 制作的视频,都必须要在简介明确指明用于变声器转换的输入源歌声、音频,例如:使用他人发布的视频 / 音频,通过分离的人声作为输入源进行转换的,必须要给出明确的原视频、音乐链接;若使用是自己的人声,或是使用其他歌声合成引擎合成的声音作为输入源进行转换的,也必须在简介加以说明。
3. 由输入源造成的侵权问题需自行承担全部责任和一切后果。使用其他商用歌声合成软件作为输入源时,请确保遵守该软件的使用条例,注意,许多歌声合成引擎使用条例中明确指明不可用于输入源进行转换!
4. 继续使用视为已同意本仓库 README 所述相关条例,本仓库 README 已进行劝导义务,不对后续可能存在问题负责。
5. 如将本仓库代码二次分发,或将由此项目产出的任何结果公开发表 (包括但不限于视频网站投稿),请注明原作者及代码来源 (此仓库)。
6. 如果将此项目用于任何其他企划,请提前联系并告知本仓库作者,十分感谢。
## update
## 🆕 Update!
> 更新了4.0-v2模型全部流程同4.0相比4.0在部分场景下有一定提升,但也有些情况有退步,具体可移步[4.0-v2分支](https://github.com/svc-develop-team/so-vits-svc/tree/4.0-v2)
## 模型简介
## 📝 模型简介
歌声音色转换模型通过SoftVC内容编码器提取源音频语音特征与F0同时输入VITS替换原本的文本输入达到歌声转换的效果。同时更换声码器为 [NSF HiFiGAN](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan) 解决断音问题
### 4.0版本更新内容
### 🆕 4.0 版本更新内容
+ 特征输入更换为 [Content Vec](https://github.com/auspicious3000/contentvec)
+ 采样率统一使用44100hz
@ -29,7 +35,11 @@
+ 增加了可选项 1vc模式自动预测音高f0,即转换语音时不需要手动输入变调key男女声的调能自动转换但仅限语音转换该模式转换歌声会跑调
+ 增加了可选项 2通过kmeans聚类方案减小音色泄漏即使得音色更加像目标音色
## 预先下载的模型文件
## 💬 关于 Python 版本问题
我们在进行测试后,认为 Python 3.8.9 版本能够稳定地运行该项目
## 📥 预先下载的模型文件
#### **必须项**
@ -51,11 +61,11 @@ http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best_legacy_500.pt
虽然底模一般不会引起什么版权问题,但还是请注意一下,比如事先询问作者,又或者作者在模型描述中明确写明了可行的用途
## 数据集准备
## 📊 数据集准备
仅需要以以下文件结构将数据集放入dataset_raw目录即可
```shell
```
dataset_raw
├───speaker0
│ ├───xxx1-xxx1.wav
@ -67,15 +77,25 @@ dataset_raw
└───xxx7-xxx007.wav
```
## 数据预处理
可以自定义说话人名称
1. 重采样至 44100hz
```
dataset_raw
└───suijiSUI
├───1.wav
├───...
└───25788785-20221210-200143-856_01_(Vocals)_0_0.wav
```
## 🛠️ 数据预处理
1. 重采样至44100Hz单声道
```shell
python resample.py
```
2. 自动划分训练集 验证集 测试集 以及自动生成配置文件
2. 自动划分训练集、验证集,以及自动生成配置文件
```shell
python preprocess_flist_config.py
@ -89,14 +109,15 @@ python preprocess_hubert_f0.py
执行完以上步骤后 dataset 目录便是预处理完成的数据可以删除dataset_raw文件夹了
## 训练
## 🏋️‍♀️ 训练
```shell
python train.py -c configs/config.json -m 44k
```
训练时会自动清除老的模型只保留最新3个模型如果想防止过拟合需要自己手动备份模型记录点,或修改配置文件keep_ckpts 0为永不清除
## 推理
## 🤖 推理
使用 [inference_main.py](inference_main.py)
@ -113,13 +134,14 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
+ -n, --clean_nameswav 文件名列表,放在 raw 文件夹下。
+ -t, --trans音高调整支持正负半音
+ -s, --spk_list合成目标说话人名称。
+ -cl, --clip音频自动切片0为不切片单位为秒/s。
可选项部分:见下一节
+ -a, --auto_predict_f0语音转换自动预测音高转换歌声时不要打开这个会严重跑调。
+ -cm, --cluster_model_path聚类模型路径如果没有训练聚类则随便填。
+ -cr, --cluster_infer_ratio聚类方案占比范围 0-1若没有训练聚类模型则填 0 即可。
## 可选项
## 🤔 可选项
如果前面的效果已经满意,或者没看明白下面在讲啥,那后面的内容都可以忽略,不影响模型使用(这些可选项影响比较小,可能在某些特定数据上有点效果,但大部分情况似乎都感知不太明显)
@ -130,8 +152,7 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
### 聚类音色泄漏控制
介绍:聚类方案可以减小音色泄漏,使得模型训练出来更像目标的音色(但其实不是特别明显),但是单纯的聚类方案会降低模型的咬字(会口齿不清)(这个很明显),本模型采用了融合的方式,
可以线性控制聚类方案与非聚类方案的占比,也就是可以手动在"像目标音色" 和 "咬字清晰" 之间调整比例,找到合适的折中点。
介绍:聚类方案可以减小音色泄漏,使得模型训练出来更像目标的音色(但其实不是特别明显),但是单纯的聚类方案会降低模型的咬字(会口齿不清)(这个很明显),本模型采用了融合的方式,可以线性控制聚类方案与非聚类方案的占比,也就是可以手动在"像目标音色" 和 "咬字清晰" 之间调整比例,找到合适的折中点。
使用聚类前面的已有步骤不用进行任何的变动,只需要额外训练一个聚类模型,虽然效果比较有限,但训练成本也比较低
@ -146,7 +167,7 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
#### [23/03/16] 不再需要手动下载hubert
## Onnx导出
## 📤 Onnx导出
使用 [onnx_export.py](onnx_export.py)
+ 新建文件夹:`checkpoints` 并打开
@ -163,21 +184,49 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
+ 注意Hubert Onnx模型请使用MoeSS提供的模型目前无法自行导出fairseq中Hubert有不少onnx不支持的算子和涉及到常量的东西在导出时会报错或者导出的模型输入输出shape和结果都有问题
[Hubert4.0](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel)
## 一些法律条例参考
## ☀️ 旧贡献者
因为某些原因原作者进行了删库处理本仓库重建之初由于组织成员疏忽直接重新上传了所有文件导致以前的contributors全部木大现在在README里重新添加一个旧贡献者列表
*某些成员已根据其个人意愿不将其列出*
<table>
<tr>
<td align="center"><a href="https://github.com/MistEO"><img src="https://avatars.githubusercontent.com/u/18511905?v=4" width="100px;" alt=""/><br /><sub><b>MistEO</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/XiaoMiku01"><img src="https://avatars.githubusercontent.com/u/54094119?v=4" width="100px;" alt=""/><br /><sub><b>XiaoMiku01</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/ForsakenRei"><img src="https://avatars.githubusercontent.com/u/23041178?v=4" width="100px;" alt=""/><br /><sub><b>しぐれ</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/TomoGaSukunai"><img src="https://avatars.githubusercontent.com/u/25863522?v=4" width="100px;" alt=""/><br /><sub><b>TomoGaSukunai</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/Plachtaa"><img src="https://avatars.githubusercontent.com/u/112609742?v=4" width="100px;" alt=""/><br /><sub><b>Plachtaa</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/zdxiaoda"><img src="https://avatars.githubusercontent.com/u/45501959?v=4" width="100px;" alt=""/><br /><sub><b>zd小达</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/Archivoice"><img src="https://avatars.githubusercontent.com/u/107520869?v=4" width="100px;" alt=""/><br /><sub><b>凍聲響世</b></sub></a><br /></td>
</tr>
</table>
## 📚 一些法律条例参考
#### 任何国家,地区,组织和个人使用此项目必须遵守以下法律
#### 《民法典》
##### 第一千零一十九条
##### 第一千零一十九条
任何组织或者个人不得以丑化、污损,或者利用信息技术手段伪造等方式侵害他人的肖像权。未经肖像权人同意,不得制作、使用、公开肖像权人的肖像,但是法律另有规定的除外。
未经肖像权人同意,肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。
对自然人声音的保护,参照适用肖像权保护的有关规定。
任何组织或者个人不得以丑化、污损,或者利用信息技术手段伪造等方式侵害他人的肖像权。未经肖像权人同意,不得制作、使用、公开肖像权人的肖像,但是法律另有规定的除外。 未经肖像权人同意,肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。 对自然人声音的保护,参照适用肖像权保护的有关规定。
##### 第一千零二十四条
##### 第一千零二十四条
【名誉权】民事主体享有名誉权。任何组织或者个人不得以侮辱、诽谤等方式侵害他人的名誉权。
【名誉权】民事主体享有名誉权。任何组织或者个人不得以侮辱、诽谤等方式侵害他人的名誉权。
##### 第一千零二十七条
##### 第一千零二十七条
【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象,含有侮辱、诽谤内容,侵害他人名誉权的,受害人有权依法请求该行为人承担民事责任。
行为人发表的文学、艺术作品不以特定人为描述对象,仅其中的情节与该特定人的情况相似的,不承担民事责任。
【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象,含有侮辱、诽谤内容,侵害他人名誉权的,受害人有权依法请求该行为人承担民事责任。 行为人发表的文学、艺术作品不以特定人为描述对象,仅其中的情节与该特定人的情况相似的,不承担民事责任。
#### 《[中华人民共和国宪法](http://www.gov.cn/guoqing/2018-03/22/content_5276318.htm)》
#### 《[中华人民共和国刑法](http://gongbao.court.gov.cn/Details/f8e30d0689b23f57bfc782d21035c3.html?sw=中华人民共和国刑法)》
#### 《[中华人民共和国民法典](http://gongbao.court.gov.cn/Details/51eb6750b8361f79be8f90d09bc202.html)》
## 💪 感谢所有的贡献者
<a href="https://github.com/svc-develop-team/so-vits-svc/graphs/contributors" target="_blank">
<img src="https://contrib.rocks/image?repo=svc-develop-team/so-vits-svc" />
</a>

View File

@ -47,6 +47,8 @@ class TextAudioSpeakerLoader(torch.utils.data.Dataset):
audio_norm = audio / self.max_wav_value
audio_norm = audio_norm.unsqueeze(0)
spec_filename = filename.replace(".wav", ".spec.pt")
# Ideally, all data generated after Mar 25 should have .spec.pt
if os.path.exists(spec_filename):
spec = torch.load(spec_filename)
else:

View File

@ -102,6 +102,10 @@ def pad_array(arr, target_length):
pad_right = pad_width - pad_left
padded_arr = np.pad(arr, (pad_left, pad_right), 'constant', constant_values=(0, 0))
return padded_arr
def split_list_by_n(list_collection, n, pre=0):
for i in range(0, len(list_collection), n):
yield list_collection[i-pre if i-pre>=0 else i: i + n]
class F0FilterException(Exception):
@ -194,38 +198,59 @@ class Svc(object):
# 清理显存
torch.cuda.empty_cache()
def slice_inference(self,raw_audio_path, spk, tran, slice_db,cluster_infer_ratio, auto_predict_f0,noice_scale, pad_seconds=0.5):
def slice_inference(self,raw_audio_path, spk, tran, slice_db,cluster_infer_ratio, auto_predict_f0,noice_scale, pad_seconds=0.5, clip_seconds=0,lg_num=0,lgr_num =0.75):
wav_path = raw_audio_path
chunks = slicer.cut(wav_path, db_thresh=slice_db)
audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
per_size = int(clip_seconds*audio_sr)
lg_size = int(lg_num*audio_sr)
lg_size_r = int(lg_size*lgr_num)
lg_size_c_l = (lg_size-lg_size_r)//2
lg_size_c_r = lg_size-lg_size_r-lg_size_c_l
lg = np.linspace(0,1,lg_size_r) if lg_size!=0 else 0
audio = []
for (slice_tag, data) in audio_data:
print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
# padd
pad_len = int(audio_sr * pad_seconds)
data = np.concatenate([np.zeros([pad_len]), data, np.zeros([pad_len])])
length = int(np.ceil(len(data) / audio_sr * self.target_sample))
raw_path = io.BytesIO()
soundfile.write(raw_path, data, audio_sr, format="wav")
raw_path.seek(0)
if slice_tag:
print('jump empty segment')
_audio = np.zeros(length)
audio.extend(list(pad_array(_audio, length)))
continue
if per_size != 0:
datas = split_list_by_n(data, per_size,lg_size)
else:
datas = [data]
for k,dat in enumerate(datas):
per_length = int(np.ceil(len(dat) / audio_sr * self.target_sample)) if clip_seconds!=0 else length
if clip_seconds!=0: print(f'###=====segment clip start, {round(len(dat) / audio_sr, 3)}s======')
# padd
pad_len = int(audio_sr * pad_seconds)
dat = np.concatenate([np.zeros([pad_len]), dat, np.zeros([pad_len])])
raw_path = io.BytesIO()
soundfile.write(raw_path, dat, audio_sr, format="wav")
raw_path.seek(0)
out_audio, out_sr = self.infer(spk, tran, raw_path,
cluster_infer_ratio=cluster_infer_ratio,
auto_predict_f0=auto_predict_f0,
noice_scale=noice_scale
)
_audio = out_audio.cpu().numpy()
pad_len = int(self.target_sample * pad_seconds)
_audio = _audio[pad_len:-pad_len]
audio.extend(list(_audio))
pad_len = int(self.target_sample * pad_seconds)
_audio = _audio[pad_len:-pad_len]
_audio = pad_array(_audio, per_length)
if lg_size!=0 and k!=0:
lg1 = audio[-(lg_size_r+lg_size_c_r):-lg_size_c_r] if lgr_num != 1 else audio[-lg_size:]
lg2 = _audio[lg_size_c_l:lg_size_c_l+lg_size_r] if lgr_num != 1 else _audio[0:lg_size]
lg_pre = lg1*(1-lg)+lg2*lg
audio = audio[0:-(lg_size_r+lg_size_c_r)] if lgr_num != 1 else audio[0:-lg_size]
audio.extend(lg_pre)
_audio = _audio[lg_size_c_l+lg_size_r:] if lgr_num != 1 else _audio[lg_size:]
audio.extend(list(_audio))
return np.array(audio)
class RealTimeVC:
def __init__(self):
self.last_chunk = None

View File

@ -25,6 +25,7 @@ def main():
# 一定要设置的部分
parser.add_argument('-m', '--model_path', type=str, default="logs/44k/G_0.pth", help='模型路径')
parser.add_argument('-c', '--config_path', type=str, default="configs/config.json", help='配置文件路径')
parser.add_argument('-cl', '--clip', type=float, default=0, help='音频自动切片0为不切片单位为秒/s')
parser.add_argument('-n', '--clean_names', type=str, nargs='+', default=["君の知らない物語-src.wav"], help='wav文件名列表放在raw文件夹下')
parser.add_argument('-t', '--trans', type=int, nargs='+', default=[0], help='音高调整,支持正负(半音)')
parser.add_argument('-s', '--spk_list', type=str, nargs='+', default=['nen'], help='合成目标说话人名称')
@ -34,6 +35,7 @@ def main():
help='语音转换自动预测音高,转换歌声时不要打开这个会严重跑调')
parser.add_argument('-cm', '--cluster_model_path', type=str, default="logs/44k/kmeans_10000.pt", help='聚类模型路径,如果没有训练聚类则随便填')
parser.add_argument('-cr', '--cluster_infer_ratio', type=float, default=0, help='聚类方案占比范围0-1若没有训练聚类模型则填0即可')
parser.add_argument('-lg', '--linear_gradient', type=float, default=0, help='两段音频切片的交叉淡入长度如果自动切片后出现人声不连贯可调整该数值如果连贯建议采用默认值0单位为秒/s')
# 不用动的部分
parser.add_argument('-sd', '--slice_db', type=int, default=-40, help='默认-40嘈杂的音频可以-30干声保留呼吸可以-50')
@ -41,6 +43,7 @@ def main():
parser.add_argument('-ns', '--noice_scale', type=float, default=0.4, help='噪音级别,会影响咬字和音质,较为玄学')
parser.add_argument('-p', '--pad_seconds', type=float, default=0.5, help='推理音频pad秒数由于未知原因开头结尾会有异响pad一小段静音段后就不会出现')
parser.add_argument('-wf', '--wav_format', type=str, default='flac', help='音频输出格式')
parser.add_argument('-lgr', '--linear_gradient_retain', type=float, default=0.75, help='自动音频切片后需要舍弃每段切片的头尾。该参数设置交叉长度保留的比例范围0-1,左开右闭')
args = parser.parse_args()
@ -55,6 +58,9 @@ def main():
cluster_infer_ratio = args.cluster_infer_ratio
noice_scale = args.noice_scale
pad_seconds = args.pad_seconds
clip = args.clip
lg = args.linear_gradient
lgr = args.linear_gradient_retain
infer_tool.fill_a_to_b(trans, clean_names)
for clean_name, tran in zip(clean_names, trans):
@ -65,22 +71,36 @@ def main():
wav_path = Path(raw_audio_path).with_suffix('.wav')
chunks = slicer.cut(wav_path, db_thresh=slice_db)
audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
per_size = int(clip*audio_sr)
lg_size = int(lg*audio_sr)
lg_size_r = int(lg_size*lgr)
lg_size_c_l = (lg_size-lg_size_r)//2
lg_size_c_r = lg_size-lg_size_r-lg_size_c_l
lg = np.linspace(0,1,lg_size_r) if lg_size!=0 else 0
for spk in spk_list:
audio = []
for (slice_tag, data) in audio_data:
print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
length = int(np.ceil(len(data) / audio_sr * svc_model.target_sample))
if slice_tag:
print('jump empty segment')
_audio = np.zeros(length)
audio.extend(list(infer_tool.pad_array(_audio, length)))
continue
if per_size != 0:
datas = infer_tool.split_list_by_n(data, per_size,lg_size)
else:
datas = [data]
for k,dat in enumerate(datas):
per_length = int(np.ceil(len(dat) / audio_sr * svc_model.target_sample)) if clip!=0 else length
if clip!=0: print(f'###=====segment clip start, {round(len(dat) / audio_sr, 3)}s======')
# padd
pad_len = int(audio_sr * pad_seconds)
data = np.concatenate([np.zeros([pad_len]), data, np.zeros([pad_len])])
dat = np.concatenate([np.zeros([pad_len]), dat, np.zeros([pad_len])])
raw_path = io.BytesIO()
soundfile.write(raw_path, data, audio_sr, format="wav")
soundfile.write(raw_path, dat, audio_sr, format="wav")
raw_path.seek(0)
out_audio, out_sr = svc_model.infer(spk, tran, raw_path,
cluster_infer_ratio=cluster_infer_ratio,
@ -90,8 +110,15 @@ def main():
_audio = out_audio.cpu().numpy()
pad_len = int(svc_model.target_sample * pad_seconds)
_audio = _audio[pad_len:-pad_len]
audio.extend(list(infer_tool.pad_array(_audio, length)))
_audio = infer_tool.pad_array(_audio, per_length)
if lg_size!=0 and k!=0:
lg1 = audio[-(lg_size_r+lg_size_c_r):-lg_size_c_r] if lgr != 1 else audio[-lg_size:]
lg2 = _audio[lg_size_c_l:lg_size_c_l+lg_size_r] if lgr != 1 else _audio[0:lg_size]
lg_pre = lg1*(1-lg)+lg2*lg
audio = audio[0:-(lg_size_r+lg_size_c_r)] if lgr != 1 else audio[0:-lg_size]
audio.extend(lg_pre)
_audio = _audio[lg_size_c_l+lg_size_r:] if lgr != 1 else _audio[lg_size:]
audio.extend(list(_audio))
key = "auto" if auto_predict_f0 else f"{tran}key"
cluster_name = "" if cluster_infer_ratio == 0 else f"_{cluster_infer_ratio}"
res_path = f'./results/{clean_name}_{key}_{spk}{cluster_name}.{wav_format}'

View File

@ -16,11 +16,12 @@ def main(NetExport):
for i in SVCVITS.parameters():
i.requires_grad = False
test_hidden_unit = torch.rand(1, 10, 256)
test_pitch = torch.rand(1, 10)
test_mel2ph = torch.LongTensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]).unsqueeze(0)
test_uv = torch.ones(1, 10, dtype=torch.float32)
test_noise = torch.randn(1, 192, 10)
n_frame = 10
test_hidden_unit = torch.rand(1, n_frame, 256)
test_pitch = torch.rand(1, n_frame)
test_mel2ph = torch.arange(0, n_frame, dtype=torch.int64)[None] # torch.LongTensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]).unsqueeze(0)
test_uv = torch.ones(1, n_frame, dtype=torch.float32)
test_noise = torch.randn(1, 192, n_frame)
test_sid = torch.LongTensor([0])
input_names = ["c", "f0", "mel2ph", "uv", "noise", "sid"]
output_names = ["audio", ]

View File

@ -25,13 +25,11 @@ if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--train_list", type=str, default="./filelists/train.txt", help="path to train list")
parser.add_argument("--val_list", type=str, default="./filelists/val.txt", help="path to val list")
parser.add_argument("--test_list", type=str, default="./filelists/test.txt", help="path to test list")
parser.add_argument("--source_dir", type=str, default="./dataset/44k", help="path to source dir")
args = parser.parse_args()
train = []
val = []
test = []
idx = 0
spk_dict = {}
spk_id = 0
@ -51,13 +49,11 @@ if __name__ == "__main__":
new_wavs.append(file)
wavs = new_wavs
shuffle(wavs)
train += wavs[2:-2]
train += wavs[2:]
val += wavs[:2]
test += wavs[-2:]
shuffle(train)
shuffle(val)
shuffle(test)
print("Writing", args.train_list)
with open(args.train_list, "w") as f:
@ -70,14 +66,10 @@ if __name__ == "__main__":
for fname in tqdm(val):
wavpath = fname
f.write(wavpath + "\n")
print("Writing", args.test_list)
with open(args.test_list, "w") as f:
for fname in tqdm(test):
wavpath = fname
f.write(wavpath + "\n")
config_template["spk"] = spk_dict
config_template["model"]["n_speakers"] = spk_id
print("Writing configs/config.json")
with open("configs/config.json", "w") as f:
json.dump(config_template, f, indent=2)

View File

@ -7,10 +7,12 @@ from random import shuffle
import torch
from glob import glob
from tqdm import tqdm
from modules.mel_processing import spectrogram_torch
import utils
import logging
logging.getLogger('numba').setLevel(logging.WARNING)
logging.getLogger("numba").setLevel(logging.WARNING)
import librosa
import numpy as np
@ -24,16 +26,47 @@ def process_one(filename, hmodel):
wav, sr = librosa.load(filename, sr=sampling_rate)
soft_path = filename + ".soft.pt"
if not os.path.exists(soft_path):
devive = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
wav16k = librosa.resample(wav, orig_sr=sampling_rate, target_sr=16000)
wav16k = torch.from_numpy(wav16k).to(devive)
wav16k = torch.from_numpy(wav16k).to(device)
c = utils.get_hubert_content(hmodel, wav_16k_tensor=wav16k)
torch.save(c.cpu(), soft_path)
f0_path = filename + ".f0.npy"
if not os.path.exists(f0_path):
f0 = utils.compute_f0_dio(wav, sampling_rate=sampling_rate, hop_length=hop_length)
f0 = utils.compute_f0_dio(
wav, sampling_rate=sampling_rate, hop_length=hop_length
)
np.save(f0_path, f0)
spec_path = filename.replace(".wav", ".spec.pt")
if not os.path.exists(spec_path):
# Process spectrogram
# The following code can't be replaced by torch.FloatTensor(wav)
# because load_wav_to_torch return a tensor that need to be normalized
audio, sr = utils.load_wav_to_torch(filename)
if sr != hps.data.sampling_rate:
raise ValueError(
"{} SR doesn't match target {} SR".format(
sr, hps.data.sampling_rate
)
)
audio_norm = audio / hps.data.max_wav_value
audio_norm = audio_norm.unsqueeze(0)
spec = spectrogram_torch(
audio_norm,
hps.data.filter_length,
hps.data.sampling_rate,
hps.data.hop_length,
hps.data.win_length,
center=False,
)
spec = torch.squeeze(spec, 0)
torch.save(spec, spec_path)
def process_batch(filenames):
print("Loading hubert for content...")
@ -46,17 +79,23 @@ def process_batch(filenames):
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--in_dir", type=str, default="dataset/44k", help="path to input dir")
parser.add_argument(
"--in_dir", type=str, default="dataset/44k", help="path to input dir"
)
args = parser.parse_args()
filenames = glob(f'{args.in_dir}/*/*.wav', recursive=True) # [:10]
filenames = glob(f"{args.in_dir}/*/*.wav", recursive=True) # [:10]
shuffle(filenames)
multiprocessing.set_start_method('spawn',force=True)
multiprocessing.set_start_method("spawn", force=True)
num_processes = 1
chunk_size = int(math.ceil(len(filenames) / num_processes))
chunks = [filenames[i:i + chunk_size] for i in range(0, len(filenames), chunk_size)]
chunks = [
filenames[i : i + chunk_size] for i in range(0, len(filenames), chunk_size)
]
print([len(c) for c in chunks])
processes = [multiprocessing.Process(target=process_batch, args=(chunk,)) for chunk in chunks]
processes = [
multiprocessing.Process(target=process_batch, args=(chunk,)) for chunk in chunks
]
for p in processes:
p.start()

View File

@ -16,3 +16,4 @@ onnxoptimizer
fairseq==0.12.2
librosa==0.8.1
tensorboard
tensorboardX

View File

@ -2,8 +2,8 @@ librosa==0.9.2
fairseq==0.12.2
Flask==2.1.2
Flask_Cors==3.0.10
gradio==3.4.1
numpy==1.20.0
gradio
numpy
playsound==1.3.0
PyAudio==0.2.12
pydub==0.25.1
@ -19,3 +19,4 @@ praat-parselmouth
onnx
onnxsim
onnxoptimizer
tensorboardX

View File

@ -1,22 +0,0 @@
from data_utils import TextAudioSpeakerLoader
import json
from tqdm import tqdm
from utils import HParams
config_path = 'configs/config.json'
with open(config_path, "r") as f:
data = f.read()
config = json.loads(data)
hps = HParams(**config)
train_dataset = TextAudioSpeakerLoader("filelists/train.txt", hps)
test_dataset = TextAudioSpeakerLoader("filelists/test.txt", hps)
eval_dataset = TextAudioSpeakerLoader("filelists/val.txt", hps)
for _ in tqdm(train_dataset):
pass
for _ in tqdm(eval_dataset):
pass
for _ in tqdm(test_dataset):
pass

View File

@ -3,6 +3,8 @@ import multiprocessing
import time
logging.getLogger('matplotlib').setLevel(logging.WARNING)
logging.getLogger('numba').setLevel(logging.WARNING)
import os
import json
import argparse

101
webUI.py Normal file
View File

@ -0,0 +1,101 @@
import io
import os
# os.system("wget -P cvec/ https://huggingface.co/spaces/innnky/nanami/resolve/main/checkpoint_best_legacy_500.pt")
import gradio as gr
import librosa
import numpy as np
import soundfile
from inference.infer_tool import Svc
import logging
import torch
logging.getLogger('numba').setLevel(logging.WARNING)
logging.getLogger('markdown_it').setLevel(logging.WARNING)
logging.getLogger('urllib3').setLevel(logging.WARNING)
logging.getLogger('matplotlib').setLevel(logging.WARNING)
logging.getLogger('multipart').setLevel(logging.WARNING)
model = None
spk = None
cuda = []
if torch.cuda.is_available():
for i in range(torch.cuda.device_count()):
cuda.append("cuda:{}".format(i))
def vc_fn(sid, input_audio, vc_transform, auto_f0,cluster_ratio, slice_db, noise_scale,pad_seconds,cl_num,lg_num,lgr_num):
global model
try:
if input_audio is None:
return "You need to upload an audio", None
if model is None:
return "You need to upload an model", None
sampling_rate, audio = input_audio
# print(audio.shape,sampling_rate)
audio = (audio / np.iinfo(audio.dtype).max).astype(np.float32)
if len(audio.shape) > 1:
audio = librosa.to_mono(audio.transpose(1, 0))
temp_path = "temp.wav"
soundfile.write(temp_path, audio, sampling_rate, format="wav")
_audio = model.slice_inference(temp_path, sid, vc_transform, slice_db, cluster_ratio, auto_f0, noise_scale,pad_seconds,cl_num,lg_num,lgr_num)
model.clear_empty()
os.remove(temp_path)
return "Success", (model.target_sample, _audio)
except Exception as e:
return "异常信息:"+str(e)+"\n请排障后重试",None
app = gr.Blocks()
with app:
with gr.Tabs():
with gr.TabItem("Sovits4.0"):
gr.Markdown(value="""
Sovits4.0 WebUI
""")
gr.Markdown(value="""
<font size=3>下面是模型文件选择</font>
""")
model_path = gr.File(label="模型文件")
gr.Markdown(value="""
<font size=3>下面是配置文件选择</font>
""")
config_path = gr.File(label="配置文件")
gr.Markdown(value="""
<font size=3>下面是聚类模型文件选择没有可以不填</font>
""")
cluster_model_path = gr.File(label="聚类模型文件")
device = gr.Dropdown(label="推理设备默认为自动选择cpu和gpu",choices=["Auto",*cuda,"cpu"],value="Auto")
gr.Markdown(value="""
<font size=3>全部上传完毕后(全部文件模块显示download),点击模型解析进行解析</font>
""")
model_analysis_button = gr.Button(value="模型解析")
sid = gr.Dropdown(label="音色(说话人)")
sid_output = gr.Textbox(label="Output Message")
vc_input3 = gr.Audio(label="上传音频")
vc_transform = gr.Number(label="变调整数可以正负半音数量升高八度就是12", value=0)
cluster_ratio = gr.Number(label="聚类模型混合比例0-1之间默认为0不启用聚类能提升音色相似度但会导致咬字下降如果使用建议0.5左右)", value=0)
auto_f0 = gr.Checkbox(label="自动f0预测配合聚类模型f0预测效果更好,会导致变调功能失效(仅限转换语音,歌声不要勾选此项会究极跑调)", value=False)
slice_db = gr.Number(label="切片阈值", value=-40)
noise_scale = gr.Number(label="noise_scale 建议不要动,会影响音质,玄学参数", value=0.4)
cl_num = gr.Number(label="音频自动切片0为不切片单位为秒/s", value=0)
pad_seconds = gr.Number(label="推理音频pad秒数由于未知原因开头结尾会有异响pad一小段静音段后就不会出现", value=0.5)
lg_num = gr.Number(label="两端音频切片的交叉淡入长度如果自动切片后出现人声不连贯可调整该数值如果连贯建议采用默认值0注意该设置会影响推理速度单位为秒/s", value=0)
lgr_num = gr.Number(label="自动音频切片后需要舍弃每段切片的头尾。该参数设置交叉长度保留的比例范围0-1,左开右闭", value=0.75,interactive=True)
vc_submit = gr.Button("转换", variant="primary")
vc_output1 = gr.Textbox(label="Output Message")
vc_output2 = gr.Audio(label="Output Audio")
def modelAnalysis(model_path,config_path,cluster_model_path,device):
try:
global model
model = Svc(model_path.name, config_path.name,device=device if device!="Auto" else None,cluster_model_path= cluster_model_path.name if cluster_model_path!=None else "")
spks = list(model.spk2id.keys())
device_name = torch.cuda.get_device_properties(model.dev).name if "cuda" in str(model.dev) else str(model.dev)
return sid.update(choices = spks,value=spks[0]),"ok,模型被加载到了设备{}之上".format(device_name)
except Exception as e:
return "","异常信息:"+str(e)+"\n请排障后重试"
vc_submit.click(vc_fn, [sid, vc_input3, vc_transform,auto_f0,cluster_ratio, slice_db, noise_scale,pad_seconds,cl_num,lg_num,lgr_num], [vc_output1, vc_output2])
model_analysis_button.click(modelAnalysis,[model_path,config_path,cluster_model_path,device],[sid,sid_output])
app.launch()