Update README.md (#144)

- Update README.md (29aee6b8a7c9c86b46f63bf6fc2331151935026b)
This commit is contained in:
Sanchit Gandhi 2024-08-06 15:41:16 +00:00 committed by system
parent 3d0618527a
commit afda370583
No known key found for this signature in database
GPG Key ID: 6A528E38E0733467

329
README.md

@ -114,68 +114,39 @@ license: apache-2.0
# Whisper # Whisper
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper
of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains **without** the need [Robust Speech Recognition via Large-Scale Weak Supervision](https://huggingface.co/papers/2212.04356) by Alec Radford
for fine-tuning. et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many
datasets and domains in a zero-shot setting.
Whisper was proposed in the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356) Whisper large-v3 has the same architecture as the previous [large](https://huggingface.co/openai/whisper-large) and [large-v2](https://huggingface.co/openai/whisper-large-v2)
by Alec Radford et al. from OpenAI. The original code repository can be found [here](https://github.com/openai/whisper). models, except for the following minor differences:
Whisper `large-v3` has the same architecture as the previous large models except the following minor differences: 1. The spectrogram input uses 128 Mel frequency bins instead of 80
1. The input uses 128 Mel frequency bins instead of 80
2. A new language token for Cantonese 2. A new language token for Cantonese
The Whisper `large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`. The Whisper large-v3 model was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled
The model was trained for 2.0 epochs over this mixture dataset. audio collected using Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) . The model was trained for 2.0 epochs over this mixture dataset.
The `large-v3` model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors compared to Whisper `large-v2`. The large-v3 model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors
compared to Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) . For more details on the different checkpoints available, refer to the section [Model details](#model-details).
**Disclaimer**: Content for this model card has partly been written by the πŸ€— Hugging Face team, and partly copied and
**Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were pasted from the original model card.
copied and pasted from the original model card.
## Model details
Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model.
It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
The models were trained on either English-only data or multilingual data. The English-only models were trained
on the task of speech recognition. The multilingual models were trained on both speech recognition and speech
translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio.
For speech translation, the model predicts transcriptions to a *different* language to the audio.
Whisper checkpoints come in five configurations of varying model sizes.
The smallest four are trained on either English-only or multilingual data.
The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints
are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The
checkpoints are summarised in the following table with links to the models on the Hub:
| Size | Parameters | English-only | Multilingual |
|----------|------------|------------------------------------------------------|-----------------------------------------------------|
| tiny | 39 M | [βœ“](https://huggingface.co/openai/whisper-tiny.en) | [βœ“](https://huggingface.co/openai/whisper-tiny) |
| base | 74 M | [βœ“](https://huggingface.co/openai/whisper-base.en) | [βœ“](https://huggingface.co/openai/whisper-base) |
| small | 244 M | [βœ“](https://huggingface.co/openai/whisper-small.en) | [βœ“](https://huggingface.co/openai/whisper-small) |
| medium | 769 M | [βœ“](https://huggingface.co/openai/whisper-medium.en) | [βœ“](https://huggingface.co/openai/whisper-medium) |
| large | 1550 M | x | [βœ“](https://huggingface.co/openai/whisper-large) |
| large-v2 | 1550 M | x | [βœ“](https://huggingface.co/openai/whisper-large-v2) |
| large-v3 | 1550 M | x | [βœ“](https://huggingface.co/openai/whisper-large-v3) |
## Usage ## Usage
Whisper `large-v3` is supported in Hugging Face πŸ€— Transformers. To run the model, first Whisper large-v3 is supported in Hugging Face πŸ€— Transformers. To run the model, first install the Transformers
install the Transformers library through the GitHub repo. For this example, we'll also install πŸ€— Datasets to load toy library. For this example, we'll also install πŸ€— Datasets to load toy audio dataset from the Hugging Face Hub, and
audio dataset from the Hugging Face Hub: πŸ€— Accelerate to reduce the model loading time:
```bash ```bash
pip install --upgrade pip pip install --upgrade pip
pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio] pip install --upgrade transformers datasets[audio] accelerate
``` ```
### Short-Form Transcription
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
class to transcribe short-form audio files (< 30-seconds) as follows: class to transcribe audios of arbitrary length:
```python ```python
import torch import torch
@ -200,10 +171,6 @@ pipe = pipeline(
model=model, model=model,
tokenizer=processor.tokenizer, tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor, feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype, torch_dtype=torch_dtype,
device=device, device=device,
) )
@ -216,9 +183,33 @@ print(result["text"])
``` ```
To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline: To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
```diff
- result = pipe(sample) ```python
+ result = pipe("audio.mp3") result = pipe("audio.mp3")
```
Multiple audio files can be transcribed in parallel by specifying them as a list and setting the `batch_size` parameter:
```python
result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)
```
Transformers is compatible with all Whisper decoding strategies, such as temperature fallback and condition on previous
tokens. The following example demonstrates how to enable these heuristics:
```python
generate_kwargs = {
"max_new_tokens": 448,
"num_beams": 1,
"condition_on_prev_tokens": False,
"compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
"logprob_threshold": -1.0,
"no_speech_threshold": 0.6,
"return_timestamps": True,
}
result = pipe(sample, generate_kwargs=generate_kwargs)
``` ```
Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it
@ -261,10 +252,6 @@ print(result["chunks"])
<summary> For more control over the generation parameters, use the model + processor API directly: </summary> <summary> For more control over the generation parameters, use the model + processor API directly: </summary>
Ad-hoc generation arguments can be passed to `model.generate`, including `num_beams` for beam-search, `return_timestamps`
for segment-level timestamps, and `prompt_ids` for prompting. See the [docstrings](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperForConditionalGeneration.generate)
for more details.
```python ```python
import torch import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
@ -277,100 +264,7 @@ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3" model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained( model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]
input_features = processor(
sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features
input_features = input_features.to(device, dtype=torch_dtype)
gen_kwargs = {
"max_new_tokens": 128,
"num_beams": 1,
"return_timestamps": False,
}
pred_ids = model.generate(input_features, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])
print(pred_text)
```
</details>
### Sequential Long-Form
This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds),
and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).
The sequential long-form algorithm should be used in either of the following scenarios:
1. Transcription accuracy is the most important factor, and latency is less of a consideration
2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
class can be used to transcribe long audio files with the sequential algorithm as follows:
```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
```
<details>
<summary> For more control over the generation parameters, use the model + processor API directly: </summary>
```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
) )
model.to(device) model.to(device)
@ -409,15 +303,29 @@ print(pred_text)
</details> </details>
## Additional Speed & Memory Improvements
You can apply additional speed and memory improvements to Whisper to further reduce the inference speed and VRAM
requirements.
### Chunked Long-Form ### Chunked Long-Form
large-v3 remains compatible with the Transformers chunked long-form algorithm. This algorithm should be used when Whisper has a receptive field of 30-seconds. To transcribe audios longer than this, one of two long-form algorithms are
a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances, required:
the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the 1. **Sequential:** uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other
[Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf)). 2. **Chunked:** splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries
To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For distil-large-v3, a chunk length of 25-seconds The sequential long-form algorithm should be used in either of the following scenarios:
is optimal. To activate batching over long audio files, pass the argument `batch_size`: 1. Transcription accuracy is the most important factor, and speed is less of a consideration
2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
Conversely, the chunked algorithm should be used when:
1. Transcription speed is the most important factor
2. You are transcribing a **single** long audio file
By default, Transformers uses the sequential algorithm. To enable the chunked algorithm, pass the `chunk_length_s`
parameter to the `pipeline`. For large-v3, a chunk length of 30-seconds is optimal. To activate batching over long
audio files, pass the argument `batch_size`:
```python ```python
import torch import torch
@ -431,7 +339,7 @@ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3" model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained( model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
) )
model.to(device) model.to(device)
@ -442,9 +350,8 @@ pipe = pipeline(
model=model, model=model,
tokenizer=processor.tokenizer, tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor, feature_extractor=processor.feature_extractor,
max_new_tokens=128, chunk_length_s=30,
chunk_length_s=25, batch_size=16, # batch size for inference - set based on your device
batch_size=16,
torch_dtype=torch_dtype, torch_dtype=torch_dtype,
device=device, device=device,
) )
@ -456,16 +363,65 @@ result = pipe(sample)
print(result["text"]) print(result["text"])
``` ```
### Additional Speed & Memory Improvements #### Torch compile
You can apply additional speed and memory improvements to Distil-Whisper to further reduce the inference speed and VRAM The Whisper forward pass is compatible with [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html)
requirements. These optimisations primarily target the attention kernel, swapping it from an eager implementation to a for 4.5x speed-ups.
more efficient flash attention version.
**Note:** `torch.compile` is currently not compatible with the Chunked long-form algorithm or Flash Attention 2 ⚠️
```python
import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
from tqdm import tqdm
torch.set_float32_matmul_precision("high")
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
).to(device)
# Enable static cache and compile the forward pass
model.generation_config.cache_implementation = "static"
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
# 2 warmup steps
for _ in tqdm(range(2), desc="Warm-up step"):
with sdpa_kernel(SDPBackend.MATH):
result = pipe(sample.copy())
# fast run
with sdpa_kernel(SDPBackend.MATH):
result = pipe(sample.copy())
print(result["text"])
```
#### Flash Attention 2 #### Flash Attention 2
We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU supports it and you are not using [torch.compile](#torch-compile).
if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention): To do so, first install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
``` ```
pip install flash-attn --no-build-isolation pip install flash-attn --no-build-isolation
@ -473,9 +429,8 @@ pip install flash-attn --no-build-isolation
Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`: Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
```diff ```python
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True) model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="flash_attention_2")
``` ```
#### Torch Scale-Product-Attention (SDPA) #### Torch Scale-Product-Attention (SDPA)
@ -496,20 +451,36 @@ returns `False`, you need to upgrade your PyTorch version according to the [offi
Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying
`attn_implementation="sdpa"` as follows: `attn_implementation="sdpa"` as follows:
```diff ```python
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True) model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa")
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa")
``` ```
For more information about how to use the SDPA refer to the [Transformers SDPA documentation](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention). For more information about how to use the SDPA refer to the [Transformers SDPA documentation](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention).
#### Torch compile
Coming soon... ## Model details
#### 4-bit and 8-bit Inference Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model. There are two
flavours of Whisper model: English-only and multilingual. The English-only models were trained on the task of English
speech recognition. The multilingual models were trained simultaneously on multilingual speech recognition and speech
translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio. For speech
translation, the model predicts transcriptions to a *different* language to the audio.
Whisper checkpoints come in five configurations of varying model sizes. The smallest four are available as English-only
and multilingual. The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints
are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The
checkpoints are summarised in the following table with links to the models on the Hub:
| Size | Parameters | English-only | Multilingual |
|----------|------------|------------------------------------------------------|-----------------------------------------------------|
| tiny | 39 M | [βœ“](https://huggingface.co/openai/whisper-tiny.en) | [βœ“](https://huggingface.co/openai/whisper-tiny) |
| base | 74 M | [βœ“](https://huggingface.co/openai/whisper-base.en) | [βœ“](https://huggingface.co/openai/whisper-base) |
| small | 244 M | [βœ“](https://huggingface.co/openai/whisper-small.en) | [βœ“](https://huggingface.co/openai/whisper-small) |
| medium | 769 M | [βœ“](https://huggingface.co/openai/whisper-medium.en) | [βœ“](https://huggingface.co/openai/whisper-medium) |
| large | 1550 M | x | [βœ“](https://huggingface.co/openai/whisper-large) |
| large-v2 | 1550 M | x | [βœ“](https://huggingface.co/openai/whisper-large-v2) |
| large-v3 | 1550 M | x | [βœ“](https://huggingface.co/openai/whisper-large-v3) |
Coming soon...
## Fine-Tuning ## Fine-Tuning
@ -529,7 +500,7 @@ In particular, we caution against using Whisper models to transcribe recordings
## Training Data ## Training Data
The models are trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`. The large-v3 checkpoint is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2.
As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language. As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.