Merge branch '4.0' of https://github.com/svc-develop-team/so-vits-svc into 4.0

2023-03-31 01:29:25 +08:00 · 2023-03-31 01:29:25 +08:00 · 0098c87d1e
parent 87bc1f8782 883d2cd2e5
commit 0098c87d1e
18 changed files with 643 additions and 113 deletions
--- a/.github/ISSUE_TEMPLATE/ask_for_help.yaml
+++ b/.github/ISSUE_TEMPLATE/ask_for_help.yaml
@ -0,0 +1,124 @@
+name: 请求帮助
+description: 遇到了无法自行解决的错误
+title: '[Help]: '
+labels: [ "help wanted" ]
+
+body:
+  - type: markdown
+    attributes:
+      value: |
+        #### 提问前请先自己去尝试解决，比如查看[本仓库wiki中的Quick solution](https://github.com/svc-develop-team/so-vits-svc/wiki/Quick-solution)，也可以借助chatgpt或一些搜索引擎（谷歌/必应/New Bing/StackOverflow等等）。如果实在无法自己解决再发issue，在提issue之前，请先了解《[提问的智慧](https://github.com/ryanhanwu/How-To-Ask-Questions-The-Smart-Way/blob/main/README-zh_CN.md)》。
+        ---
+        ### 什么样的issue会被直接close
+        1. 伸手党
+        2. 一键包/环境包相关
+        3. 提供的信息不全
+        4. 低级的如缺少依赖而导致无法运行的问题
+        4. 所用的数据集是无授权数据集(游戏角色/二次元人物暂不归为此类，但是训练时候也要小心谨慎。如果能联系到官方，必须先和官方联系并核实清楚)
+        ---
+
+  - type: checkboxes
+    id: Clause
+    attributes:
+      label: 请勾选下方的确认框。
+      options:
+        - label: "我已仔细阅读[README.md](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/README_zh_CN.md)和[wiki中的Quick solution](https://github.com/svc-develop-team/so-vits-svc/wiki/Quick-solution)。"
+          required: true
+        - label: "我已通过各种搜索引擎排查问题，我要提出的问题并不常见。"
+          required: true
+        - label: "我未在使用由第三方用户提供的一键包/环境包。"
+          required: true
+
+  - type: markdown
+    attributes:
+      value: |
+        # 请根据实际使用环境填写以下信息
+
+  - type: input
+    id: System
+    attributes:
+      label: 系统平台版本号
+      description: Windows执行`winver` | Linux执行`uname -a`
+    validations:
+      required: true
+
+  - type: input
+    id: GPU
+    attributes:
+      label: GPU 型号
+      description: 执行`nvidia-smi`
+    validations:
+      required: true
+
+  - type: input
+    id: PythonVersion
+    attributes:
+      label: Python版本
+      description: 执行`python -V`
+    validations:
+      required: true
+
+  - type: input
+    id: PyTorchVersion
+    attributes:
+      label: PyTorch版本
+      description: 执行`pip show torch`
+    validations:
+      required: true
+
+  - type: dropdown
+    id: Branch
+    attributes:
+      label: sovits分支
+      options:
+        - 4.0(默认)
+        - 4.0-v2
+        - 3.0-32k
+        - 3.0-48k
+    validations:
+      required: true
+
+  - type: input
+    id: DatasetSource
+    attributes:
+      label: 数据集来源（用于判断数据集质量）
+      description: 如：UVR处理过的vtb直播音频、录音棚录制
+    validations:
+      required: true
+
+  - type: input
+    id: WhereOccurs
+    attributes:
+      label: 出现问题的环节或执行的命令
+      description: 如：预处理、训练、`python preprocess_hubert_f0.py`
+    validations:
+      required: true
+
+  - type: textarea
+    id: Description
+    attributes:
+      label: 问题描述
+      description: 在这里描述自己的问题，越详细越好
+    validations:
+      required: true
+
+  - type: textarea
+    id: Log
+    attributes:
+      label: 日志
+      description: 将从执行命令到执行完毕输出的所有信息（包括你所执行的命令）粘贴到[pastebin.com](https://pastebin.com/)并把剪贴板链接贴到这里
+      render: python
+    validations:
+      required: true
+
+  - type: textarea
+    id: ValidOneClick
+    attributes:
+      label: 截图`so-vits-svc`、`logs/44k`文件夹并粘贴到此处
+    validations:
+      required: true
+
+  - type: textarea
+    id: Supplementary
+    attributes:
+      label: 补充说明
--- a/.github/ISSUE_TEMPLATE/ask_for_help_en_US.yaml
+++ b/.github/ISSUE_TEMPLATE/ask_for_help_en_US.yaml
@ -0,0 +1,124 @@
+name: Ask for help
+description: Encountered an error cannot be resolved by self
+title: '[Help]: '
+labels: [ "help wanted" ]
+
+body:
+  - type: markdown
+    attributes:
+      value: |
+        #### Please try to solve the problem yourself before asking for help. At first you can read *[Quick solution in wiki](https://github.com/svc-develop-team/so-vits-svc/wiki/Quick-solution)*. Then you can use chatgpt or some search engines like google, bing, new bing and StackOverflow until you really find that you can't solve it by yourself. And before you raise an issue, please understand *[How To Ask Questions The Smart Way](http://www.catb.org/~esr/faqs/smart-questions.html)* in advance.
+        ---
+        ### What kind of issue will be closed immediately
+        1. Beggars or Free Riders
+        2. One click package / Environment package (Not using `pip install -r requirement.txt`)
+        3. Incomplete information
+        4. Stupid issues such as miss a dependency package
+        4. Using unlicenced dataset (Game characters / anime characters are not included in this category temporarily but you still need to pay attention. If you can contact the official, you must contact the official and verify it at first.)
+        ---
+
+  - type: checkboxes
+    id: Clause
+    attributes:
+      label: Please check the checkboxes below.
+      options:
+        - label: "I have read *[README.md](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/README.md)* and *[Quick solution in wiki](https://github.com/svc-develop-team/so-vits-svc/wiki/Quick-solution)* carefully."
+          required: true
+        - label: "I have been troubleshooting issues through various search engines. The questions I want to ask are not common."
+          required: true
+        - label: "I am NOT using one click package / environment package."
+          required: true
+
+  - type: markdown
+    attributes:
+      value: |
+        # Please fill in the following information according to your actual environment
+
+  - type: input
+    id: System
+    attributes:
+      label: OS version
+      description: Windows run `winver` | Linux run `uname -a`
+    validations:
+      required: true
+
+  - type: input
+    id: GPU
+    attributes:
+      label: GPU
+      description: Run `nvidia-smi`
+    validations:
+      required: true
+
+  - type: input
+    id: PythonVersion
+    attributes:
+      label: Python version
+      description: Run `python -V`
+    validations:
+      required: true
+
+  - type: input
+    id: PyTorchVersion
+    attributes:
+      label: PyTorch version
+      description: Run `pip show torch`
+    validations:
+      required: true
+
+  - type: dropdown
+    id: Branch
+    attributes:
+      label: Branch of sovits
+      options:
+        - 4.0(Default)
+        - 4.0-v2
+        - 3.0-32k
+        - 3.0-48k
+    validations:
+      required: true
+
+  - type: input
+    id: DatasetSource
+    attributes:
+      label: Dataset source (Used to judge the dataset quality)
+      description: Such as UVR-processed streaming audio / Recorded in recording studio
+    validations:
+      required: true
+
+  - type: input
+    id: WhereOccurs
+    attributes:
+      label: Where thr problem occurs or what command you executed
+      description: Such as Preprocessing / Training / `python preprocess_hubert_f0.py`
+    validations:
+      required: true
+
+  - type: textarea
+    id: Description
+    attributes:
+      label: Problem description
+      description: Describe your problem here, the more detailed the better.
+    validations:
+      required: true
+
+  - type: textarea
+    id: Log
+    attributes:
+      label: Log
+      description: All information output from the command you executed to the end of execution (include the command)
+      render: python
+    validations:
+      required: true
+
+  - type: textarea
+    id: ValidOneClick
+    attributes:
+      label: Screenshot `so-vits-svc` and `logs/44k` folders and paste here
+    validations:
+      required: true
+
+  - type: textarea
+    id: Supplementary
+    attributes:
+      label: Supplementary description
--- a/.github/ISSUE_TEMPLATE/config.yml
+++ b/.github/ISSUE_TEMPLATE/config.yml
@ -0,0 +1,5 @@
+blank_issues_enabled: false
+contact_links:
+  - name: 讨论区 / Discussions
+    url: https://github.com/svc-develop-team/so-vits-svc/discussions
+    about: 简单的询问/讨论请转至讨论区或发起一个低优先级的Default issue / For simple inquiries / discussions, please go to the discussions or raise a low priority Default issue
--- a/.github/ISSUE_TEMPLATE/default.md
+++ b/.github/ISSUE_TEMPLATE/default.md
@ -0,0 +1,7 @@
+---
+name: Default issue
+about: 如果模板中没有你想发起的issue类型，可以选择此项，但这个issue也许会获得一个较低的处理优先级 / If there is no issue type you want to raise, you can start with this one. But this issue maybe will get a lower priority to deal with.
+title: ''
+labels: 'not urgent'
+assignees: ''
+---
--- a/.github/no-response.yml
+++ b/.github/no-response.yml
@ -0,0 +1,3 @@
+daysUntilClose: 7
+responseRequiredLabel: waiting response
+closeComment: 由于缺少必要信息且没有回应，该 issue 已被自动关闭，如有需要补充的内容请回复并自行重新打开该 issue
--- a/README.md
+++ b/README.md
@ -2,24 +2,30 @@

 [**English**](./README.md) | [**中文简体**](./README_zh_CN.md)

-## Terms of Use
+#### ✨ A fork with a greatly improved interface: [34j/so-vits-svc-fork](https://github.com/34j/so-vits-svc-fork)

-1. This project is established for academic exchange purposes only and is intended for communication and learning purposes. It is not intended for production environments. Please solve the authorization problem of the dataset on your own. You shall be solely responsible for any problems caused by the use of non-authorized datasets for training and all consequences thereof.
+#### ✨ A client supports real-time conversion: [w-okada/voice-changer](https://github.com/w-okada/voice-changer)
+
+## 📏 Terms of Use
+
+# Warning: Please solve the authorization problem of the dataset on your own. You shall be solely responsible for any problems caused by the use of non-authorized datasets for training and all consequences thereof.The repository and its maintainer, svc develop team, have nothing to do with the consequences!
+
+1. This project is established for academic exchange purposes only and is intended for communication and learning purposes. It is not intended for production environments. 
 2. Any videos based on sovits that are published on video platforms must clearly indicate in the description that they are used for voice changing and specify the input source of the voice or audio, for example, using videos or audios published by others and separating the vocals as input source for conversion, which must provide clear original video or music links. If your own voice or other synthesized voices from other commercial vocal synthesis software are used as the input source for conversion, you must also explain it in the description.
 3. You shall be solely responsible for any infringement problems caused by the input source. When using other commercial vocal synthesis software as input source, please ensure that you comply with the terms of use of the software. Note that many vocal synthesis engines clearly state in their terms of use that they cannot be used for input source conversion.
 4. Continuing to use this project is deemed as agreeing to the relevant provisions stated in this repository README. This repository README has the obligation to persuade, and is not responsible for any subsequent problems that may arise.
 5. If you distribute this repository's code or publish any results produced by this project publicly (including but not limited to video sharing platforms), please indicate the original author and code source (this repository).
 6. If you use this project for any other plan, please contact and inform the author of this repository in advance. Thank you very much.

-## Update
+## 🆕 Update!

 > Updated the 4.0-v2 model, the entire process is the same as 4.0. Compared to 4.0, there is some improvement in certain scenarios, but there are also some cases where it has regressed. Please refer to the [4.0-v2 branch](https://github.com/svc-develop-team/so-vits-svc/tree/4.0-v2) for more information.

-## Model Introduction
+## 📝 Model Introduction

 The singing voice conversion model uses SoftVC content encoder to extract source audio speech features, then the vectors are directly fed into VITS instead of converting to a text based intermediate; thus the pitch and intonations are conserved. Additionally, the vocoder is changed to [NSF HiFiGAN](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan) to solve the problem of sound interruption.

-### 4.0 Version Update Content
+### 🆕 4.0 Version Update Content

 - Feature input is changed to [Content Vec](https://github.com/auspicious3000/contentvec)
 - The sampling rate is unified to use 44100Hz
@ -29,7 +35,11 @@ The singing voice conversion model uses SoftVC content encoder to extract source
 - Added an option 1: automatic pitch prediction for vc mode, which means that you don't need to manually enter the pitch key when converting speech, and the pitch of male and female voices can be automatically converted. However, this mode will cause pitch shift when converting songs.
 - Added option 2: reduce timbre leakage through k-means clustering scheme, making the timbre more similar to the target timbre.

-## Pre-trained Model Files
+## 💬 About Python Version
+
+After conducting tests, we believe that the project runs stably on Python version 3.8.9.
+
+## 📥 Pre-trained Model Files

 #### **Required**

@ -51,11 +61,11 @@ Get them from svc-develop-team(TBD) or anywhere else.

 Although the pretrained model generally does not cause any copyright problems, please pay attention to it. For example, ask the author in advance, or the author has indicated the feasible use in the description clearly.

-## Dataset Preparation
+## 📊 Dataset Preparation

 Simply place the dataset in the `dataset_raw` directory with the following file structure.

-```shell
+```
 dataset_raw
 ├───speaker0
 │   ├───xxx1-xxx1.wav
@ -67,15 +77,25 @@ dataset_raw
    └───xxx7-xxx007.wav
 ```

-## Preprocessing
+You can customize the speaker name.

-1. Resample to 44100hz
+```
+dataset_raw
+└───suijiSUI
+    ├───1.wav
+    ├───...
+    └───25788785-20221210-200143-856_01_(Vocals)_0_0.wav
+```
+
+## 🛠️ Preprocessing
+
+1. Resample to 44100Hz and mono

 ```shell
 python resample.py
 ```

-2. Automatically split the dataset into training, validation, and test sets, and generate configuration files
+2. Automatically split the dataset into training and validation sets, and generate configuration files

 ```shell
 python preprocess_flist_config.py
@ -89,7 +109,7 @@ python preprocess_hubert_f0.py

 After completing the above steps, the dataset directory will contain the preprocessed data, and the dataset_raw folder can be deleted.

-## Training
+## 🏋️‍♀️ Training

 ```shell
 python train.py -c configs/config.json -m 44k
@ -97,7 +117,7 @@ python train.py -c configs/config.json -m 44k

 Note: During training, the old models will be automatically cleared and only the latest three models will be kept. If you want to prevent overfitting, you need to manually backup the model checkpoints, or modify the configuration file `keep_ckpts` to 0 to never clear them.

-## Inference
+## 🤖 Inference

 Use [inference_main.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/inference_main.py)

@ -109,27 +129,25 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
 ```

 Required parameters:
-
 - -m, --model_path: path to the model.
 - -c, --config_path: path to the configuration file.
 - -n, --clean_names: a list of wav file names located in the raw folder.
 - -t, --trans: pitch adjustment, supports positive and negative (semitone) values.
 - -s, --spk_list: target speaker name for synthesis.
+- -cl, --clip: voice auto-split,set to 0 to turn off,duration in seconds.

 Optional parameters: see the next section
-
 - -a, --auto_predict_f0: automatic pitch prediction for voice conversion, do not enable this when converting songs as it can cause serious pitch issues.
 - -cm, --cluster_model_path: path to the clustering model, fill in any value if clustering is not trained.
 - -cr, --cluster_infer_ratio: proportion of the clustering solution, range 0-1, fill in 0 if the clustering model is not trained.

-## Optional Settings
+## 🤔 Optional Settings

 If the results from the previous section are satisfactory, or if you didn't understand what is being discussed in the following section, you can skip it, and it won't affect the model usage. (These optional settings have a relatively small impact, and they may have some effect on certain specific data, but in most cases, the difference may not be noticeable.)

 ### Automatic f0 prediction

 During the 4.0 model training, an f0 predictor is also trained, which can be used for automatic pitch prediction during voice conversion. However, if the effect is not good, manual pitch prediction can be used instead. But please do not enable this feature when converting singing voice as it may cause serious pitch shifting!
-
 - Set "auto_predict_f0" to true in inference_main.

 ### Cluster-based timbre leakage control
@ -149,7 +167,7 @@ The existing steps before clustering do not need to be changed. All you need to

 #### [23/03/16] No longer need to download hubert manually

-## Exporting to Onnx
+## 📤 Exporting to Onnx

 Use [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/onnx_export.py)

@ -166,7 +184,27 @@ Use [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/on

 Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently, they cannot be exported on their own (Hubert in fairseq has many unsupported operators and things involving constants that can cause errors or result in problems with the input/output shape and results when exported.)  [Hubert4.0](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel)

-## Some legal provisions for reference
+## ☀️ Previous contributors
+
+For some reason the author deleted the original repository. Because of the negligence of the organization members, the contributor list was cleared because all files were directly reuploaded to this repository at the beginning of the reconstruction of this repository. Now add a previous contributor list to README.md.
+
+*Some members have not listed according to their personal wishes.*
+
+<table>
+  <tr>
+    <td align="center"><a href="https://github.com/MistEO"><img src="https://avatars.githubusercontent.com/u/18511905?v=4" width="100px;" alt=""/><br /><sub><b>MistEO</b></sub></a><br /></td>
+    <td align="center"><a href="https://github.com/XiaoMiku01"><img src="https://avatars.githubusercontent.com/u/54094119?v=4" width="100px;" alt=""/><br /><sub><b>XiaoMiku01</b></sub></a><br /></td>
+    <td align="center"><a href="https://github.com/ForsakenRei"><img src="https://avatars.githubusercontent.com/u/23041178?v=4" width="100px;" alt=""/><br /><sub><b>しぐれ</b></sub></a><br /></td>
+    <td align="center"><a href="https://github.com/TomoGaSukunai"><img src="https://avatars.githubusercontent.com/u/25863522?v=4" width="100px;" alt=""/><br /><sub><b>TomoGaSukunai</b></sub></a><br /></td>
+    <td align="center"><a href="https://github.com/Plachtaa"><img src="https://avatars.githubusercontent.com/u/112609742?v=4" width="100px;" alt=""/><br /><sub><b>Plachtaa</b></sub></a><br /></td>
+    <td align="center"><a href="https://github.com/zdxiaoda"><img src="https://avatars.githubusercontent.com/u/45501959?v=4" width="100px;" alt=""/><br /><sub><b>zd小达</b></sub></a><br /></td>
+    <td align="center"><a href="https://github.com/Archivoice"><img src="https://avatars.githubusercontent.com/u/107520869?v=4" width="100px;" alt=""/><br /><sub><b>凍聲響世</b></sub></a><br /></td>
+  </tr>
+</table>
+
+## 📚 Some legal provisions for reference
+
+#### Any country, region, organization, or individual using this project must comply with the following laws.

 #### 《民法典》

@ -184,3 +222,14 @@ Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently

 【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象，含有侮辱、诽谤内容，侵害他人名誉权的，受害人有权依法请求该行为人承担民事责任。
 行为人发表的文学、艺术作品不以特定人为描述对象，仅其中的情节与该特定人的情况相似的，不承担民事责任。  
+
+#### 《[中华人民共和国宪法](http://www.gov.cn/guoqing/2018-03/22/content_5276318.htm)》
+
+#### 《[中华人民共和国刑法](http://gongbao.court.gov.cn/Details/f8e30d0689b23f57bfc782d21035c3.html?sw=%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD%E5%88%91%E6%B3%95)》
+
+#### 《[中华人民共和国民法典](http://gongbao.court.gov.cn/Details/51eb6750b8361f79be8f90d09bc202.html)》
+
+## 💪 Thanks to all contributors for their efforts
+<a href="https://github.com/svc-develop-team/so-vits-svc/graphs/contributors" target="_blank">
+  <img src="https://contrib.rocks/image?repo=svc-develop-team/so-vits-svc" />
+</a>
--- a/README_zh_CN.md
+++ b/README_zh_CN.md
@ -2,24 +2,30 @@

 [**English**](./README.md) | [**中文简体**](./README_zh_CN.md)

-## 使用规约
+#### ✨ 改善了交互的一个分支推荐：[34j/so-vits-svc-fork](https://github.com/34j/so-vits-svc-fork)

-1. 本项目是基于学术交流目的建立，仅供交流与学习使用，并非为生产环境准备，请自行解决数据集的授权问题，任何由于使用非授权数据集进行训练造成的问题，需自行承担全部责任和一切后果！
+#### ✨ 支持实时转换的一个客户端：[w-okada/voice-changer](https://github.com/w-okada/voice-changer)
+
+## 📏 使用规约
+
+# Warning：请自行解决数据集授权问题，禁止使用非授权数据集进行训练！任何由于使用非授权数据集进行训练造成的问题，需自行承担全部责任和后果！与仓库、仓库维护者、svc develop team 无关！
+
+1. 本项目是基于学术交流目的建立，仅供交流与学习使用，并非为生产环境准备。
 2. 任何发布到视频平台的基于 sovits 制作的视频，都必须要在简介明确指明用于变声器转换的输入源歌声、音频，例如：使用他人发布的视频 / 音频，通过分离的人声作为输入源进行转换的，必须要给出明确的原视频、音乐链接；若使用是自己的人声，或是使用其他歌声合成引擎合成的声音作为输入源进行转换的，也必须在简介加以说明。
 3. 由输入源造成的侵权问题需自行承担全部责任和一切后果。使用其他商用歌声合成软件作为输入源时，请确保遵守该软件的使用条例，注意，许多歌声合成引擎使用条例中明确指明不可用于输入源进行转换！
 4. 继续使用视为已同意本仓库 README 所述相关条例，本仓库 README 已进行劝导义务，不对后续可能存在问题负责。
 5. 如将本仓库代码二次分发，或将由此项目产出的任何结果公开发表 (包括但不限于视频网站投稿)，请注明原作者及代码来源 (此仓库)。
 6. 如果将此项目用于任何其他企划，请提前联系并告知本仓库作者，十分感谢。

-## update
+## 🆕 Update!

 > 更新了4.0-v2模型，全部流程同4.0，相比4.0在部分场景下有一定提升，但也有些情况有退步，具体可移步[4.0-v2分支](https://github.com/svc-develop-team/so-vits-svc/tree/4.0-v2)

-## 模型简介
+## 📝 模型简介

 歌声音色转换模型，通过SoftVC内容编码器提取源音频语音特征，与F0同时输入VITS替换原本的文本输入达到歌声转换的效果。同时，更换声码器为 [NSF HiFiGAN](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan) 解决断音问题

-### 4.0版本更新内容
+### 🆕 4.0 版本更新内容

 + 特征输入更换为 [Content Vec](https://github.com/auspicious3000/contentvec) 
 + 采样率统一使用44100hz
@ -29,7 +35,11 @@
 + 增加了可选项 1：vc模式自动预测音高f0,即转换语音时不需要手动输入变调key，男女声的调能自动转换，但仅限语音转换，该模式转换歌声会跑调
 + 增加了可选项 2：通过kmeans聚类方案减小音色泄漏，即使得音色更加像目标音色

-## 预先下载的模型文件
+## 💬 关于 Python 版本问题
+
+我们在进行测试后，认为 Python 3.8.9 版本能够稳定地运行该项目
+
+## 📥 预先下载的模型文件

 #### **必须项**

@ -51,11 +61,11 @@ http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best_legacy_500.pt

 虽然底模一般不会引起什么版权问题，但还是请注意一下，比如事先询问作者，又或者作者在模型描述中明确写明了可行的用途

-## 数据集准备
+## 📊 数据集准备

 仅需要以以下文件结构将数据集放入dataset_raw目录即可

-```shell
+```
 dataset_raw
 ├───speaker0
 │   ├───xxx1-xxx1.wav
@ -67,15 +77,25 @@ dataset_raw
    └───xxx7-xxx007.wav
 ```

-## 数据预处理
+可以自定义说话人名称

-1. 重采样至 44100hz
+```
+dataset_raw
+└───suijiSUI
+    ├───1.wav
+    ├───...
+    └───25788785-20221210-200143-856_01_(Vocals)_0_0.wav
+```
+
+## 🛠️ 数据预处理
+
+1. 重采样至44100Hz单声道

 ```shell
 python resample.py
 ```
- 
-2. 自动划分训练集 验证集 测试集 以及自动生成配置文件
+
+2. 自动划分训练集、验证集，以及自动生成配置文件

 ```shell
 python preprocess_flist_config.py
@ -89,14 +109,15 @@ python preprocess_hubert_f0.py

 执行完以上步骤后 dataset 目录便是预处理完成的数据，可以删除dataset_raw文件夹了

-## 训练
+## 🏋️‍♀️ 训练

 ```shell
 python train.py -c configs/config.json -m 44k
 ```
+
 注：训练时会自动清除老的模型，只保留最新3个模型，如果想防止过拟合需要自己手动备份模型记录点,或修改配置文件keep_ckpts 0为永不清除

-## 推理
+## 🤖 推理

 使用 [inference_main.py](inference_main.py)

@ -113,13 +134,14 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
 + -n, --clean_names：wav 文件名列表，放在 raw 文件夹下。
 + -t, --trans：音高调整，支持正负（半音）。
 + -s, --spk_list：合成目标说话人名称。
+ -cl, --clip：音频自动切片，0为不切片，单位为秒/s。

 可选项部分：见下一节
 + -a, --auto_predict_f0：语音转换自动预测音高，转换歌声时不要打开这个会严重跑调。
 + -cm, --cluster_model_path：聚类模型路径，如果没有训练聚类则随便填。
 + -cr, --cluster_infer_ratio：聚类方案占比，范围 0-1，若没有训练聚类模型则填 0 即可。

-## 可选项
+## 🤔 可选项

 如果前面的效果已经满意，或者没看明白下面在讲啥，那后面的内容都可以忽略，不影响模型使用(这些可选项影响比较小，可能在某些特定数据上有点效果，但大部分情况似乎都感知不太明显)

@ -130,8 +152,7 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "

 ### 聚类音色泄漏控制

-介绍：聚类方案可以减小音色泄漏，使得模型训练出来更像目标的音色（但其实不是特别明显），但是单纯的聚类方案会降低模型的咬字（会口齿不清）（这个很明显），本模型采用了融合的方式，
-可以线性控制聚类方案与非聚类方案的占比，也就是可以手动在"像目标音色" 和 "咬字清晰" 之间调整比例，找到合适的折中点。
+介绍：聚类方案可以减小音色泄漏，使得模型训练出来更像目标的音色（但其实不是特别明显），但是单纯的聚类方案会降低模型的咬字（会口齿不清）（这个很明显），本模型采用了融合的方式，可以线性控制聚类方案与非聚类方案的占比，也就是可以手动在"像目标音色" 和 "咬字清晰" 之间调整比例，找到合适的折中点。

 使用聚类前面的已有步骤不用进行任何的变动，只需要额外训练一个聚类模型，虽然效果比较有限，但训练成本也比较低

@ -146,7 +167,7 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "

 #### [23/03/16] 不再需要手动下载hubert

-## Onnx导出
+## 📤 Onnx导出

 使用 [onnx_export.py](onnx_export.py)
 + 新建文件夹：`checkpoints` 并打开
@ -163,21 +184,49 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "
 + 注意：Hubert Onnx模型请使用MoeSS提供的模型，目前无法自行导出（fairseq中Hubert有不少onnx不支持的算子和涉及到常量的东西，在导出时会报错或者导出的模型输入输出shape和结果都有问题）
 [Hubert4.0](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel)

-## 一些法律条例参考
+## ☀️ 旧贡献者
+
+因为某些原因原作者进行了删库处理，本仓库重建之初由于组织成员疏忽直接重新上传了所有文件导致以前的contributors全部木大，现在在README里重新添加一个旧贡献者列表
+
+*某些成员已根据其个人意愿不将其列出*
+
+<table>
+  <tr>
+    <td align="center"><a href="https://github.com/MistEO"><img src="https://avatars.githubusercontent.com/u/18511905?v=4" width="100px;" alt=""/><br /><sub><b>MistEO</b></sub></a><br /></td>
+    <td align="center"><a href="https://github.com/XiaoMiku01"><img src="https://avatars.githubusercontent.com/u/54094119?v=4" width="100px;" alt=""/><br /><sub><b>XiaoMiku01</b></sub></a><br /></td>
+    <td align="center"><a href="https://github.com/ForsakenRei"><img src="https://avatars.githubusercontent.com/u/23041178?v=4" width="100px;" alt=""/><br /><sub><b>しぐれ</b></sub></a><br /></td>
+    <td align="center"><a href="https://github.com/TomoGaSukunai"><img src="https://avatars.githubusercontent.com/u/25863522?v=4" width="100px;" alt=""/><br /><sub><b>TomoGaSukunai</b></sub></a><br /></td>
+    <td align="center"><a href="https://github.com/Plachtaa"><img src="https://avatars.githubusercontent.com/u/112609742?v=4" width="100px;" alt=""/><br /><sub><b>Plachtaa</b></sub></a><br /></td>
+    <td align="center"><a href="https://github.com/zdxiaoda"><img src="https://avatars.githubusercontent.com/u/45501959?v=4" width="100px;" alt=""/><br /><sub><b>zd小达</b></sub></a><br /></td>
+    <td align="center"><a href="https://github.com/Archivoice"><img src="https://avatars.githubusercontent.com/u/107520869?v=4" width="100px;" alt=""/><br /><sub><b>凍聲響世</b></sub></a><br /></td>
+  </tr>
+</table>
+
+## 📚 一些法律条例参考
+
+#### 任何国家，地区，组织和个人使用此项目必须遵守以下法律

 #### 《民法典》

-##### 第一千零一十九条 
+##### 第一千零一十九条

-任何组织或者个人不得以丑化、污损，或者利用信息技术手段伪造等方式侵害他人的肖像权。未经肖像权人同意，不得制作、使用、公开肖像权人的肖像，但是法律另有规定的除外。
-未经肖像权人同意，肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。
-对自然人声音的保护，参照适用肖像权保护的有关规定。
+任何组织或者个人不得以丑化、污损，或者利用信息技术手段伪造等方式侵害他人的肖像权。未经肖像权人同意，不得制作、使用、公开肖像权人的肖像，但是法律另有规定的除外。 未经肖像权人同意，肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。 对自然人声音的保护，参照适用肖像权保护的有关规定。

-#####  第一千零二十四条 
+##### 第一千零二十四条

-【名誉权】民事主体享有名誉权。任何组织或者个人不得以侮辱、诽谤等方式侵害他人的名誉权。  
+【名誉权】民事主体享有名誉权。任何组织或者个人不得以侮辱、诽谤等方式侵害他人的名誉权。

-#####  第一千零二十七条
+##### 第一千零二十七条

-【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象，含有侮辱、诽谤内容，侵害他人名誉权的，受害人有权依法请求该行为人承担民事责任。
-行为人发表的文学、艺术作品不以特定人为描述对象，仅其中的情节与该特定人的情况相似的，不承担民事责任。  
+【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象，含有侮辱、诽谤内容，侵害他人名誉权的，受害人有权依法请求该行为人承担民事责任。 行为人发表的文学、艺术作品不以特定人为描述对象，仅其中的情节与该特定人的情况相似的，不承担民事责任。
+
+#### 《[中华人民共和国宪法](http://www.gov.cn/guoqing/2018-03/22/content_5276318.htm)》
+
+#### 《[中华人民共和国刑法](http://gongbao.court.gov.cn/Details/f8e30d0689b23f57bfc782d21035c3.html?sw=中华人民共和国刑法)》
+
+#### 《[中华人民共和国民法典](http://gongbao.court.gov.cn/Details/51eb6750b8361f79be8f90d09bc202.html)》
+
+## 💪 感谢所有的贡献者
+<a href="https://github.com/svc-develop-team/so-vits-svc/graphs/contributors" target="_blank">
+  <img src="https://contrib.rocks/image?repo=svc-develop-team/so-vits-svc" />
+</a>
--- a/data_utils.py
+++ b/data_utils.py
@ -47,6 +47,8 @@ class TextAudioSpeakerLoader(torch.utils.data.Dataset):
        audio_norm = audio / self.max_wav_value
        audio_norm = audio_norm.unsqueeze(0)
        spec_filename = filename.replace(".wav", ".spec.pt")
+
+        # Ideally, all data generated after Mar 25 should have .spec.pt
        if os.path.exists(spec_filename):
            spec = torch.load(spec_filename)
        else:
--- a/inference/infer_tool.py
+++ b/inference/infer_tool.py
@ -102,6 +102,10 @@ def pad_array(arr, target_length):
        pad_right = pad_width - pad_left
        padded_arr = np.pad(arr, (pad_left, pad_right), 'constant', constant_values=(0, 0))
        return padded_arr
+    
+def split_list_by_n(list_collection, n, pre=0):
+    for i in range(0, len(list_collection), n):
+        yield list_collection[i-pre if i-pre>=0 else i: i + n]


 class F0FilterException(Exception):
@ -194,38 +198,59 @@ class Svc(object):
        # 清理显存
        torch.cuda.empty_cache()

-    def slice_inference(self,raw_audio_path, spk, tran, slice_db,cluster_infer_ratio, auto_predict_f0,noice_scale, pad_seconds=0.5):
+    def slice_inference(self,raw_audio_path, spk, tran, slice_db,cluster_infer_ratio, auto_predict_f0,noice_scale, pad_seconds=0.5, clip_seconds=0,lg_num=0,lgr_num =0.75):
        wav_path = raw_audio_path
        chunks = slicer.cut(wav_path, db_thresh=slice_db)
        audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
-
+        per_size = int(clip_seconds*audio_sr)
+        lg_size = int(lg_num*audio_sr)
+        lg_size_r = int(lg_size*lgr_num)
+        lg_size_c_l = (lg_size-lg_size_r)//2
+        lg_size_c_r = lg_size-lg_size_r-lg_size_c_l
+        lg = np.linspace(0,1,lg_size_r) if lg_size!=0 else 0
+        
        audio = []
        for (slice_tag, data) in audio_data:
            print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
            # padd
-            pad_len = int(audio_sr * pad_seconds)
-            data = np.concatenate([np.zeros([pad_len]), data, np.zeros([pad_len])])
            length = int(np.ceil(len(data) / audio_sr * self.target_sample))
-            raw_path = io.BytesIO()
-            soundfile.write(raw_path, data, audio_sr, format="wav")
-            raw_path.seek(0)
            if slice_tag:
                print('jump empty segment')
                _audio = np.zeros(length)
+                audio.extend(list(pad_array(_audio, length)))
+                continue
+            if per_size != 0:
+                datas = split_list_by_n(data, per_size,lg_size)
            else:
+                datas = [data]
+            for k,dat in enumerate(datas):
+                per_length = int(np.ceil(len(dat) / audio_sr * self.target_sample)) if clip_seconds!=0 else length
+                if clip_seconds!=0: print(f'###=====segment clip start, {round(len(dat) / audio_sr, 3)}s======')
+                # padd
+                pad_len = int(audio_sr * pad_seconds)
+                dat = np.concatenate([np.zeros([pad_len]), dat, np.zeros([pad_len])])
+                raw_path = io.BytesIO()
+                soundfile.write(raw_path, dat, audio_sr, format="wav")
+                raw_path.seek(0)
                out_audio, out_sr = self.infer(spk, tran, raw_path,
                                                    cluster_infer_ratio=cluster_infer_ratio,
                                                    auto_predict_f0=auto_predict_f0,
                                                    noice_scale=noice_scale
                                                    )
                _audio = out_audio.cpu().numpy()
-
-            pad_len = int(self.target_sample * pad_seconds)
-            _audio = _audio[pad_len:-pad_len]
-            audio.extend(list(_audio))
+                pad_len = int(self.target_sample * pad_seconds)
+                _audio = _audio[pad_len:-pad_len]
+                _audio = pad_array(_audio, per_length)
+                if lg_size!=0 and k!=0:
+                    lg1 = audio[-(lg_size_r+lg_size_c_r):-lg_size_c_r] if lgr_num != 1 else audio[-lg_size:]
+                    lg2 = _audio[lg_size_c_l:lg_size_c_l+lg_size_r]  if lgr_num != 1 else _audio[0:lg_size]
+                    lg_pre = lg1*(1-lg)+lg2*lg
+                    audio = audio[0:-(lg_size_r+lg_size_c_r)] if lgr_num != 1 else audio[0:-lg_size]
+                    audio.extend(lg_pre)
+                    _audio = _audio[lg_size_c_l+lg_size_r:] if lgr_num != 1 else _audio[lg_size:]
+                audio.extend(list(_audio))
        return np.array(audio)

-
 class RealTimeVC:
    def __init__(self):
        self.last_chunk = None
--- a/inference_main.py
+++ b/inference_main.py
@ -25,6 +25,7 @@ def main():
    # 一定要设置的部分
    parser.add_argument('-m', '--model_path', type=str, default="logs/44k/G_0.pth", help='模型路径')
    parser.add_argument('-c', '--config_path', type=str, default="configs/config.json", help='配置文件路径')
+    parser.add_argument('-cl', '--clip', type=float, default=0, help='音频自动切片，0为不切片，单位为秒/s')
    parser.add_argument('-n', '--clean_names', type=str, nargs='+', default=["君の知らない物語-src.wav"], help='wav文件名列表，放在raw文件夹下')
    parser.add_argument('-t', '--trans', type=int, nargs='+', default=[0], help='音高调整，支持正负（半音）')
    parser.add_argument('-s', '--spk_list', type=str, nargs='+', default=['nen'], help='合成目标说话人名称')
@ -34,6 +35,7 @@ def main():
                        help='语音转换自动预测音高，转换歌声时不要打开这个会严重跑调')
    parser.add_argument('-cm', '--cluster_model_path', type=str, default="logs/44k/kmeans_10000.pt", help='聚类模型路径，如果没有训练聚类则随便填')
    parser.add_argument('-cr', '--cluster_infer_ratio', type=float, default=0, help='聚类方案占比，范围0-1，若没有训练聚类模型则填0即可')
+    parser.add_argument('-lg', '--linear_gradient', type=float, default=0, help='两段音频切片的交叉淡入长度，如果自动切片后出现人声不连贯可调整该数值，如果连贯建议采用默认值0，单位为秒/s')

    # 不用动的部分
    parser.add_argument('-sd', '--slice_db', type=int, default=-40, help='默认-40，嘈杂的音频可以-30，干声保留呼吸可以-50')
@ -41,6 +43,7 @@ def main():
    parser.add_argument('-ns', '--noice_scale', type=float, default=0.4, help='噪音级别，会影响咬字和音质，较为玄学')
    parser.add_argument('-p', '--pad_seconds', type=float, default=0.5, help='推理音频pad秒数，由于未知原因开头结尾会有异响，pad一小段静音段后就不会出现')
    parser.add_argument('-wf', '--wav_format', type=str, default='flac', help='音频输出格式')
+    parser.add_argument('-lgr', '--linear_gradient_retain', type=float, default=0.75, help='自动音频切片后，需要舍弃每段切片的头尾。该参数设置交叉长度保留的比例，范围0-1,左开右闭')

    args = parser.parse_args()

@ -55,6 +58,9 @@ def main():
    cluster_infer_ratio = args.cluster_infer_ratio
    noice_scale = args.noice_scale
    pad_seconds = args.pad_seconds
+    clip = args.clip
+    lg = args.linear_gradient
+    lgr = args.linear_gradient_retain

    infer_tool.fill_a_to_b(trans, clean_names)
    for clean_name, tran in zip(clean_names, trans):
@ -65,22 +71,36 @@ def main():
        wav_path = Path(raw_audio_path).with_suffix('.wav')
        chunks = slicer.cut(wav_path, db_thresh=slice_db)
        audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
+        per_size = int(clip*audio_sr)
+        lg_size = int(lg*audio_sr)
+        lg_size_r = int(lg_size*lgr)
+        lg_size_c_l = (lg_size-lg_size_r)//2
+        lg_size_c_r = lg_size-lg_size_r-lg_size_c_l
+        lg = np.linspace(0,1,lg_size_r) if lg_size!=0 else 0

        for spk in spk_list:
            audio = []
            for (slice_tag, data) in audio_data:
                print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
-
+                
                length = int(np.ceil(len(data) / audio_sr * svc_model.target_sample))
                if slice_tag:
                    print('jump empty segment')
                    _audio = np.zeros(length)
+                    audio.extend(list(infer_tool.pad_array(_audio, length)))
+                    continue
+                if per_size != 0:
+                    datas = infer_tool.split_list_by_n(data, per_size,lg_size)
                else:
+                    datas = [data]
+                for k,dat in enumerate(datas):
+                    per_length = int(np.ceil(len(dat) / audio_sr * svc_model.target_sample)) if clip!=0 else length
+                    if clip!=0: print(f'###=====segment clip start, {round(len(dat) / audio_sr, 3)}s======')
                    # padd
                    pad_len = int(audio_sr * pad_seconds)
-                    data = np.concatenate([np.zeros([pad_len]), data, np.zeros([pad_len])])
+                    dat = np.concatenate([np.zeros([pad_len]), dat, np.zeros([pad_len])])
                    raw_path = io.BytesIO()
-                    soundfile.write(raw_path, data, audio_sr, format="wav")
+                    soundfile.write(raw_path, dat, audio_sr, format="wav")
                    raw_path.seek(0)
                    out_audio, out_sr = svc_model.infer(spk, tran, raw_path,
                                                        cluster_infer_ratio=cluster_infer_ratio,
@ -90,8 +110,15 @@ def main():
                    _audio = out_audio.cpu().numpy()
                    pad_len = int(svc_model.target_sample * pad_seconds)
                    _audio = _audio[pad_len:-pad_len]
-
-                audio.extend(list(infer_tool.pad_array(_audio, length)))
+                    _audio = infer_tool.pad_array(_audio, per_length)
+                    if lg_size!=0 and k!=0:
+                        lg1 = audio[-(lg_size_r+lg_size_c_r):-lg_size_c_r] if lgr != 1 else audio[-lg_size:]
+                        lg2 = _audio[lg_size_c_l:lg_size_c_l+lg_size_r]  if lgr != 1 else _audio[0:lg_size]
+                        lg_pre = lg1*(1-lg)+lg2*lg
+                        audio = audio[0:-(lg_size_r+lg_size_c_r)] if lgr != 1 else audio[0:-lg_size]
+                        audio.extend(lg_pre)
+                        _audio = _audio[lg_size_c_l+lg_size_r:] if lgr != 1 else _audio[lg_size:]
+                    audio.extend(list(_audio))
            key = "auto" if auto_predict_f0 else f"{tran}key"
            cluster_name = "" if cluster_infer_ratio == 0 else f"_{cluster_infer_ratio}"
            res_path = f'./results/{clean_name}_{key}_{spk}{cluster_name}.{wav_format}'
--- a/onnx_export.py
+++ b/onnx_export.py
@ -16,11 +16,12 @@ def main(NetExport):
        for i in SVCVITS.parameters():
            i.requires_grad = False
        
-        test_hidden_unit = torch.rand(1, 10, 256)
-        test_pitch = torch.rand(1, 10)
-        test_mel2ph = torch.LongTensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]).unsqueeze(0)
-        test_uv = torch.ones(1, 10, dtype=torch.float32)
-        test_noise = torch.randn(1, 192, 10)
+        n_frame = 10
+        test_hidden_unit = torch.rand(1, n_frame, 256)
+        test_pitch = torch.rand(1, n_frame)
+        test_mel2ph = torch.arange(0, n_frame, dtype=torch.int64)[None] # torch.LongTensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]).unsqueeze(0)
+        test_uv = torch.ones(1, n_frame, dtype=torch.float32)
+        test_noise = torch.randn(1, 192, n_frame)
        test_sid = torch.LongTensor([0])
        input_names = ["c", "f0", "mel2ph", "uv", "noise", "sid"]
        output_names = ["audio", ]
--- a/preprocess_flist_config.py
+++ b/preprocess_flist_config.py
@ -25,13 +25,11 @@ if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--train_list", type=str, default="./filelists/train.txt", help="path to train list")
    parser.add_argument("--val_list", type=str, default="./filelists/val.txt", help="path to val list")
-    parser.add_argument("--test_list", type=str, default="./filelists/test.txt", help="path to test list")
    parser.add_argument("--source_dir", type=str, default="./dataset/44k", help="path to source dir")
    args = parser.parse_args()
    
    train = []
    val = []
-    test = []
    idx = 0
    spk_dict = {}
    spk_id = 0
@ -51,13 +49,11 @@ if __name__ == "__main__":
            new_wavs.append(file)
        wavs = new_wavs
        shuffle(wavs)
-        train += wavs[2:-2]
+        train += wavs[2:]
        val += wavs[:2]
-        test += wavs[-2:]

    shuffle(train)
    shuffle(val)
-    shuffle(test)
            
    print("Writing", args.train_list)
    with open(args.train_list, "w") as f:
@ -70,14 +66,10 @@ if __name__ == "__main__":
        for fname in tqdm(val):
            wavpath = fname
            f.write(wavpath + "\n")
-            
-    print("Writing", args.test_list)
-    with open(args.test_list, "w") as f:
-        for fname in tqdm(test):
-            wavpath = fname
-            f.write(wavpath + "\n")

    config_template["spk"] = spk_dict
+    config_template["model"]["n_speakers"] = spk_id
+	
    print("Writing configs/config.json")
    with open("configs/config.json", "w") as f:
        json.dump(config_template, f, indent=2)
--- a/preprocess_hubert_f0.py
+++ b/preprocess_hubert_f0.py
@ -7,10 +7,12 @@ from random import shuffle
 import torch
 from glob import glob
 from tqdm import tqdm
+from modules.mel_processing import spectrogram_torch

 import utils
 import logging
-logging.getLogger('numba').setLevel(logging.WARNING)
+
+logging.getLogger("numba").setLevel(logging.WARNING)
 import librosa
 import numpy as np

@ -24,16 +26,47 @@ def process_one(filename, hmodel):
    wav, sr = librosa.load(filename, sr=sampling_rate)
    soft_path = filename + ".soft.pt"
    if not os.path.exists(soft_path):
-        devive = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        wav16k = librosa.resample(wav, orig_sr=sampling_rate, target_sr=16000)
-        wav16k = torch.from_numpy(wav16k).to(devive)
+        wav16k = torch.from_numpy(wav16k).to(device)
        c = utils.get_hubert_content(hmodel, wav_16k_tensor=wav16k)
        torch.save(c.cpu(), soft_path)
+
    f0_path = filename + ".f0.npy"
    if not os.path.exists(f0_path):
-        f0 = utils.compute_f0_dio(wav, sampling_rate=sampling_rate, hop_length=hop_length)
+        f0 = utils.compute_f0_dio(
+            wav, sampling_rate=sampling_rate, hop_length=hop_length
+        )
        np.save(f0_path, f0)

+    spec_path = filename.replace(".wav", ".spec.pt")
+    if not os.path.exists(spec_path):
+        # Process spectrogram
+        # The following code can't be replaced by torch.FloatTensor(wav)
+        # because load_wav_to_torch return a tensor that need to be normalized
+
+        audio, sr = utils.load_wav_to_torch(filename)
+        if sr != hps.data.sampling_rate:
+            raise ValueError(
+                "{} SR doesn't match target {} SR".format(
+                    sr, hps.data.sampling_rate
+                )
+            )
+
+        audio_norm = audio / hps.data.max_wav_value
+        audio_norm = audio_norm.unsqueeze(0)
+
+        spec = spectrogram_torch(
+            audio_norm,
+            hps.data.filter_length,
+            hps.data.sampling_rate,
+            hps.data.hop_length,
+            hps.data.win_length,
+            center=False,
+        )
+        spec = torch.squeeze(spec, 0)
+        torch.save(spec, spec_path)
+

 def process_batch(filenames):
    print("Loading hubert for content...")
@ -46,17 +79,23 @@ def process_batch(filenames):

 if __name__ == "__main__":
    parser = argparse.ArgumentParser()
-    parser.add_argument("--in_dir", type=str, default="dataset/44k", help="path to input dir")
+    parser.add_argument(
+        "--in_dir", type=str, default="dataset/44k", help="path to input dir"
+    )

    args = parser.parse_args()
-    filenames = glob(f'{args.in_dir}/*/*.wav', recursive=True)  # [:10]
+    filenames = glob(f"{args.in_dir}/*/*.wav", recursive=True)  # [:10]
    shuffle(filenames)
-    multiprocessing.set_start_method('spawn',force=True)
+    multiprocessing.set_start_method("spawn", force=True)

    num_processes = 1
    chunk_size = int(math.ceil(len(filenames) / num_processes))
-    chunks = [filenames[i:i + chunk_size] for i in range(0, len(filenames), chunk_size)]
+    chunks = [
+        filenames[i : i + chunk_size] for i in range(0, len(filenames), chunk_size)
+    ]
    print([len(c) for c in chunks])
-    processes = [multiprocessing.Process(target=process_batch, args=(chunk,)) for chunk in chunks]
+    processes = [
+        multiprocessing.Process(target=process_batch, args=(chunk,)) for chunk in chunks
+    ]
    for p in processes:
        p.start()
--- a/requirements.txt
+++ b/requirements.txt
@ -16,3 +16,4 @@ onnxoptimizer
 fairseq==0.12.2
 librosa==0.8.1
 tensorboard
+tensorboardX
--- a/requirements_win.txt
+++ b/requirements_win.txt
@ -2,8 +2,8 @@ librosa==0.9.2
 fairseq==0.12.2
 Flask==2.1.2
 Flask_Cors==3.0.10
-gradio==3.4.1
-numpy==1.20.0
+gradio
+numpy
 playsound==1.3.0
 PyAudio==0.2.12
 pydub==0.25.1
@ -19,3 +19,4 @@ praat-parselmouth
 onnx
 onnxsim
 onnxoptimizer
+tensorboardX
--- a/spec_gen.py
+++ b/spec_gen.py
@ -1,22 +0,0 @@
-from data_utils import TextAudioSpeakerLoader
-import json
-from tqdm import tqdm
-
-from utils import HParams
-
-config_path = 'configs/config.json'
-with open(config_path, "r") as f:
-    data = f.read()
-config = json.loads(data)
-hps = HParams(**config)
-
-train_dataset = TextAudioSpeakerLoader("filelists/train.txt", hps)
-test_dataset = TextAudioSpeakerLoader("filelists/test.txt", hps)
-eval_dataset = TextAudioSpeakerLoader("filelists/val.txt", hps)
-
-for _ in tqdm(train_dataset):
-    pass
-for _ in tqdm(eval_dataset):
-    pass
-for _ in tqdm(test_dataset):
-    pass
--- a/train.py
+++ b/train.py
@ -3,6 +3,8 @@ import multiprocessing
 import time

 logging.getLogger('matplotlib').setLevel(logging.WARNING)
+logging.getLogger('numba').setLevel(logging.WARNING)
+
 import os
 import json
 import argparse
--- a/webUI.py
+++ b/webUI.py
@ -0,0 +1,101 @@
+import io
+import os
+
+# os.system("wget -P cvec/ https://huggingface.co/spaces/innnky/nanami/resolve/main/checkpoint_best_legacy_500.pt")
+import gradio as gr
+import librosa
+import numpy as np
+import soundfile
+from inference.infer_tool import Svc
+import logging
+import torch
+
+logging.getLogger('numba').setLevel(logging.WARNING)
+logging.getLogger('markdown_it').setLevel(logging.WARNING)
+logging.getLogger('urllib3').setLevel(logging.WARNING)
+logging.getLogger('matplotlib').setLevel(logging.WARNING)
+logging.getLogger('multipart').setLevel(logging.WARNING)
+
+model = None
+spk = None
+cuda = []
+if torch.cuda.is_available():
+    for i in range(torch.cuda.device_count()):
+        cuda.append("cuda:{}".format(i))
+
+def vc_fn(sid, input_audio, vc_transform, auto_f0,cluster_ratio, slice_db, noise_scale,pad_seconds,cl_num,lg_num,lgr_num):
+    global model
+    try:
+        if input_audio is None:
+            return "You need to upload an audio", None
+        if model is None:
+            return "You need to upload an model", None
+        sampling_rate, audio = input_audio
+        # print(audio.shape,sampling_rate)
+        audio = (audio / np.iinfo(audio.dtype).max).astype(np.float32)
+        if len(audio.shape) > 1:
+            audio = librosa.to_mono(audio.transpose(1, 0))
+        temp_path = "temp.wav"
+        soundfile.write(temp_path, audio, sampling_rate, format="wav")
+        _audio = model.slice_inference(temp_path, sid, vc_transform, slice_db, cluster_ratio, auto_f0, noise_scale,pad_seconds,cl_num,lg_num,lgr_num)
+        model.clear_empty()
+        os.remove(temp_path)
+        return "Success", (model.target_sample, _audio)
+    except Exception as e:
+        return "异常信息:"+str(e)+"\n请排障后重试",None
+
+app = gr.Blocks()
+with app:
+    with gr.Tabs():
+        with gr.TabItem("Sovits4.0"):
+            gr.Markdown(value="""
+                Sovits4.0 WebUI
+                """)
+            
+            gr.Markdown(value="""
+                <font size=3>下面是模型文件选择：</font>
+                """)
+            model_path = gr.File(label="模型文件")
+            gr.Markdown(value="""
+                <font size=3>下面是配置文件选择：</font>
+                """)
+            config_path = gr.File(label="配置文件")
+            gr.Markdown(value="""
+                <font size=3>下面是聚类模型文件选择，没有可以不填：</font>
+                """)
+            cluster_model_path = gr.File(label="聚类模型文件")
+            device = gr.Dropdown(label="推理设备，默认为自动选择cpu和gpu",choices=["Auto",*cuda,"cpu"],value="Auto")
+            gr.Markdown(value="""
+                <font size=3>全部上传完毕后(全部文件模块显示download),点击模型解析进行解析：</font>
+                """)
+            model_analysis_button = gr.Button(value="模型解析")
+            sid = gr.Dropdown(label="音色（说话人）")
+            sid_output = gr.Textbox(label="Output Message")
+            vc_input3 = gr.Audio(label="上传音频")
+            vc_transform = gr.Number(label="变调（整数，可以正负，半音数量，升高八度就是12）", value=0)
+            cluster_ratio = gr.Number(label="聚类模型混合比例，0-1之间，默认为0不启用聚类，能提升音色相似度，但会导致咬字下降（如果使用建议0.5左右）", value=0)
+            auto_f0 = gr.Checkbox(label="自动f0预测，配合聚类模型f0预测效果更好,会导致变调功能失效（仅限转换语音，歌声不要勾选此项会究极跑调）", value=False)
+            slice_db = gr.Number(label="切片阈值", value=-40)
+            noise_scale = gr.Number(label="noise_scale 建议不要动，会影响音质，玄学参数", value=0.4)
+            cl_num = gr.Number(label="音频自动切片，0为不切片，单位为秒/s", value=0)
+            pad_seconds = gr.Number(label="推理音频pad秒数，由于未知原因开头结尾会有异响，pad一小段静音段后就不会出现", value=0.5)
+            lg_num = gr.Number(label="两端音频切片的交叉淡入长度，如果自动切片后出现人声不连贯可调整该数值，如果连贯建议采用默认值0，注意，该设置会影响推理速度，单位为秒/s", value=0)
+            lgr_num = gr.Number(label="自动音频切片后，需要舍弃每段切片的头尾。该参数设置交叉长度保留的比例，范围0-1,左开右闭", value=0.75,interactive=True)
+            vc_submit = gr.Button("转换", variant="primary")
+            vc_output1 = gr.Textbox(label="Output Message")
+            vc_output2 = gr.Audio(label="Output Audio")
+            def modelAnalysis(model_path,config_path,cluster_model_path,device):
+                try:
+                    global model
+                    model = Svc(model_path.name, config_path.name,device=device if device!="Auto" else None,cluster_model_path= cluster_model_path.name if cluster_model_path!=None else "")
+                    spks = list(model.spk2id.keys())
+                    device_name = torch.cuda.get_device_properties(model.dev).name if "cuda" in str(model.dev) else str(model.dev)
+                    return sid.update(choices = spks,value=spks[0]),"ok,模型被加载到了设备{}之上".format(device_name)
+                except Exception as e:
+                    return "","异常信息:"+str(e)+"\n请排障后重试"
+        vc_submit.click(vc_fn, [sid, vc_input3, vc_transform,auto_f0,cluster_ratio, slice_db, noise_scale,pad_seconds,cl_num,lg_num,lgr_num], [vc_output1, vc_output2])
+        model_analysis_button.click(modelAnalysis,[model_path,config_path,cluster_model_path,device],[sid,sid_output])
+    app.launch()
+
+
+