70a8a7d341
- Update README.md (8f943aced9e2eb6f0d863cc6e8c600c1955ee509) - Update README.md (e9a0292eee760619bef542411bf421816d153058) - Update README.md (40e663a136de821ffb01bf0c71caafbc40096f88) - Update README.md (e7fc53d8bd1776d3705573f8927f93c945c7f6e8) - Update README.md (59ccf64c288c6c6dde3668c4531040c42563913c) Co-authored-by: Vaibhav Srivastav <reach-vb@users.noreply.huggingface.co>
203 lines
6.5 KiB
Markdown
203 lines
6.5 KiB
Markdown
---
|
|
language:
|
|
- en
|
|
- de
|
|
- es
|
|
- fr
|
|
- hi
|
|
- it
|
|
- ja
|
|
- ko
|
|
- pl
|
|
- pt
|
|
- ru
|
|
- tr
|
|
- zh
|
|
thumbnail: >-
|
|
https://user-images.githubusercontent.com/5068315/230698495-cbb1ced9-c911-4c9a-941d-a1a4a1286ac6.png
|
|
library: bark
|
|
license: mit
|
|
tags:
|
|
- bark
|
|
- audio
|
|
- text-to-speech
|
|
pipeline_tag: text-to-speech
|
|
inference: true
|
|
---
|
|
|
|
# Bark
|
|
|
|
Bark is a transformer-based text-to-audio model created by [Suno](https://www.suno.ai).
|
|
Bark can generate highly realistic, multilingual speech as well as other audio - including music,
|
|
background noise and simple sound effects. The model can also produce nonverbal
|
|
communications like laughing, sighing and crying. To support the research community,
|
|
we are providing access to pretrained model checkpoints ready for inference.
|
|
|
|
The original github repo and model card can be found [here](https://github.com/suno-ai/bark).
|
|
|
|
This model is meant for research purposes only.
|
|
The model output is not censored and the authors do not endorse the opinions in the generated content.
|
|
Use at your own risk.
|
|
|
|
Two checkpoints are released:
|
|
- [small](https://huggingface.co/suno/bark-small)
|
|
- [**large** (this checkpoint)](https://huggingface.co/suno/bark)
|
|
|
|
|
|
## Example
|
|
|
|
Try out Bark yourself!
|
|
|
|
* Bark Colab:
|
|
|
|
<a target="_blank" href="https://colab.research.google.com/drive/1eJfA2XUa-mXwdMy7DoYKVYHI1iTd9Vkt?usp=sharing">
|
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
|
</a>
|
|
|
|
* Hugging Face Colab:
|
|
|
|
<a target="_blank" href="https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing">
|
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
|
</a>
|
|
|
|
* Hugging Face Demo:
|
|
|
|
<a target="_blank" href="https://huggingface.co/spaces/suno/bark">
|
|
<img src="https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg" alt="Open in HuggingFace"/>
|
|
</a>
|
|
|
|
|
|
## 🤗 Transformers Usage
|
|
|
|
You can run Bark locally with the 🤗 Transformers library from version 4.31.0 onwards.
|
|
|
|
1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers) and scipy:
|
|
|
|
```
|
|
pip install --upgrade pip
|
|
pip install --upgrade transformers scipy
|
|
```
|
|
|
|
2. Run inference via the `Text-to-Speech` (TTS) pipeline. You can infer the bark model via the TTS pipeline in just a few lines of code!
|
|
|
|
```python
|
|
from transformers import pipeline
|
|
import scipy
|
|
|
|
synthesiser = pipeline("text-to-speech", "suno/bark")
|
|
|
|
speech = synthesiser("Hello, my dog is cooler than you!", forward_params={"do_sample": True})
|
|
|
|
scipy.io.wavfile.write("bark_out.wav", rate=speech["sampling_rate"], data=speech["audio"])
|
|
```
|
|
|
|
3. Run inference via the Transformers modelling code. You can use the processor + generate code to convert text into a mono 24 kHz speech waveform for more fine-grained control.
|
|
|
|
```python
|
|
from transformers import AutoProcessor, AutoModel
|
|
|
|
processor = AutoProcessor.from_pretrained("suno/bark")
|
|
model = AutoModel.from_pretrained("suno/bark")
|
|
|
|
inputs = processor(
|
|
text=["Hello, my name is Suno. And, uh — and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."],
|
|
return_tensors="pt",
|
|
)
|
|
|
|
speech_values = model.generate(**inputs, do_sample=True)
|
|
```
|
|
|
|
4. Listen to the speech samples either in an ipynb notebook:
|
|
|
|
```python
|
|
from IPython.display import Audio
|
|
|
|
sampling_rate = model.generation_config.sample_rate
|
|
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)
|
|
```
|
|
|
|
Or save them as a `.wav` file using a third-party library, e.g. `scipy`:
|
|
|
|
```python
|
|
import scipy
|
|
|
|
sampling_rate = model.config.sample_rate
|
|
scipy.io.wavfile.write("bark_out.wav", rate=sampling_rate, data=speech_values.cpu().numpy().squeeze())
|
|
```
|
|
|
|
For more details on using the Bark model for inference using the 🤗 Transformers library, refer to the [Bark docs](https://huggingface.co/docs/transformers/model_doc/bark).
|
|
|
|
## Suno Usage
|
|
|
|
You can also run Bark locally through the original [Bark library]((https://github.com/suno-ai/bark):
|
|
|
|
1. First install the [`bark` library](https://github.com/suno-ai/bark)
|
|
|
|
2. Run the following Python code:
|
|
|
|
```python
|
|
from bark import SAMPLE_RATE, generate_audio, preload_models
|
|
from IPython.display import Audio
|
|
|
|
# download and load all models
|
|
preload_models()
|
|
|
|
# generate audio from text
|
|
text_prompt = """
|
|
Hello, my name is Suno. And, uh — and I like pizza. [laughs]
|
|
But I also have other interests such as playing tic tac toe.
|
|
"""
|
|
speech_array = generate_audio(text_prompt)
|
|
|
|
# play text in notebook
|
|
Audio(speech_array, rate=SAMPLE_RATE)
|
|
```
|
|
|
|
[pizza.webm](https://user-images.githubusercontent.com/5068315/230490503-417e688d-5115-4eee-9550-b46a2b465ee3.webm)
|
|
|
|
|
|
To save `audio_array` as a WAV file:
|
|
|
|
```python
|
|
from scipy.io.wavfile import write as write_wav
|
|
|
|
write_wav("/path/to/audio.wav", SAMPLE_RATE, audio_array)
|
|
```
|
|
|
|
## Model Details
|
|
|
|
|
|
The following is additional information about the models released here.
|
|
|
|
Bark is a series of three transformer models that turn text into audio.
|
|
|
|
### Text to semantic tokens
|
|
- Input: text, tokenized with [BERT tokenizer from Hugging Face](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer)
|
|
- Output: semantic tokens that encode the audio to be generated
|
|
|
|
### Semantic to coarse tokens
|
|
- Input: semantic tokens
|
|
- Output: tokens from the first two codebooks of the [EnCodec Codec](https://github.com/facebookresearch/encodec) from facebook
|
|
|
|
### Coarse to fine tokens
|
|
- Input: the first two codebooks from EnCodec
|
|
- Output: 8 codebooks from EnCodec
|
|
|
|
### Architecture
|
|
| Model | Parameters | Attention | Output Vocab size |
|
|
|:-------------------------:|:----------:|------------|:-----------------:|
|
|
| Text to semantic tokens | 80/300 M | Causal | 10,000 |
|
|
| Semantic to coarse tokens | 80/300 M | Causal | 2x 1,024 |
|
|
| Coarse to fine tokens | 80/300 M | Non-causal | 6x 1,024 |
|
|
|
|
|
|
### Release date
|
|
April 2023
|
|
|
|
## Broader Implications
|
|
We anticipate that this model's text to audio capabilities can be used to improve accessbility tools in a variety of languages.
|
|
|
|
While we hope that this release will enable users to express their creativity and build applications that are a force
|
|
for good, we acknowledge that any text to audio model has the potential for dual use. While it is not straightforward
|
|
to voice clone known people with Bark, it can still be used for nefarious purposes. To further reduce the chances of unintended use of Bark,
|
|
we also release a simple classifier to detect Bark-generated audio with high accuracy (see notebooks section of the main repository). |