Update README.md (#144)

- Update README.md (29aee6b8a7c9c86b46f63bf6fc2331151935026b)
This commit is contained in:
Sanchit Gandhi 2024-08-06 15:41:16 +00:00 committed by system
parent 3d0618527a
commit afda370583
No known key found for this signature in database
GPG Key ID: 6A528E38E0733467

331
README.md

@ -114,68 +114,39 @@ license: apache-2.0
# Whisper
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours
of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains **without** the need
for fine-tuning.
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper
[Robust Speech Recognition via Large-Scale Weak Supervision](https://huggingface.co/papers/2212.04356) by Alec Radford
et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many
datasets and domains in a zero-shot setting.
Whisper was proposed in the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
by Alec Radford et al. from OpenAI. The original code repository can be found [here](https://github.com/openai/whisper).
Whisper large-v3 has the same architecture as the previous [large](https://huggingface.co/openai/whisper-large) and [large-v2](https://huggingface.co/openai/whisper-large-v2)
models, except for the following minor differences:
Whisper `large-v3` has the same architecture as the previous large models except the following minor differences:
1. The input uses 128 Mel frequency bins instead of 80
1. The spectrogram input uses 128 Mel frequency bins instead of 80
2. A new language token for Cantonese
The Whisper `large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
The model was trained for 2.0 epochs over this mixture dataset.
The Whisper large-v3 model was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled
audio collected using Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) . The model was trained for 2.0 epochs over this mixture dataset.
The `large-v3` model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors compared to Whisper `large-v2`.
The large-v3 model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors
compared to Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) . For more details on the different checkpoints available, refer to the section [Model details](#model-details).
**Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
copied and pasted from the original model card.
## Model details
Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model.
It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
The models were trained on either English-only data or multilingual data. The English-only models were trained
on the task of speech recognition. The multilingual models were trained on both speech recognition and speech
translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio.
For speech translation, the model predicts transcriptions to a *different* language to the audio.
Whisper checkpoints come in five configurations of varying model sizes.
The smallest four are trained on either English-only or multilingual data.
The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints
are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The
checkpoints are summarised in the following table with links to the models on the Hub:
| Size | Parameters | English-only | Multilingual |
|----------|------------|------------------------------------------------------|-----------------------------------------------------|
| tiny | 39 M | [](https://huggingface.co/openai/whisper-tiny.en) | [](https://huggingface.co/openai/whisper-tiny) |
| base | 74 M | [](https://huggingface.co/openai/whisper-base.en) | [](https://huggingface.co/openai/whisper-base) |
| small | 244 M | [](https://huggingface.co/openai/whisper-small.en) | [](https://huggingface.co/openai/whisper-small) |
| medium | 769 M | [](https://huggingface.co/openai/whisper-medium.en) | [](https://huggingface.co/openai/whisper-medium) |
| large | 1550 M | x | [](https://huggingface.co/openai/whisper-large) |
| large-v2 | 1550 M | x | [](https://huggingface.co/openai/whisper-large-v2) |
| large-v3 | 1550 M | x | [](https://huggingface.co/openai/whisper-large-v3) |
**Disclaimer**: Content for this model card has partly been written by the 🤗 Hugging Face team, and partly copied and
pasted from the original model card.
## Usage
Whisper `large-v3` is supported in Hugging Face 🤗 Transformers. To run the model, first
install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy
audio dataset from the Hugging Face Hub:
Whisper large-v3 is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers
library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and
🤗 Accelerate to reduce the model loading time:
```bash
pip install --upgrade pip
pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
pip install --upgrade transformers datasets[audio] accelerate
```
### Short-Form Transcription
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
class to transcribe short-form audio files (< 30-seconds) as follows:
class to transcribe audios of arbitrary length:
```python
import torch
@ -200,10 +171,6 @@ pipe = pipeline(
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
@ -216,9 +183,33 @@ print(result["text"])
```
To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
```diff
- result = pipe(sample)
+ result = pipe("audio.mp3")
```python
result = pipe("audio.mp3")
```
Multiple audio files can be transcribed in parallel by specifying them as a list and setting the `batch_size` parameter:
```python
result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)
```
Transformers is compatible with all Whisper decoding strategies, such as temperature fallback and condition on previous
tokens. The following example demonstrates how to enable these heuristics:
```python
generate_kwargs = {
"max_new_tokens": 448,
"num_beams": 1,
"condition_on_prev_tokens": False,
"compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
"logprob_threshold": -1.0,
"no_speech_threshold": 0.6,
"return_timestamps": True,
}
result = pipe(sample, generate_kwargs=generate_kwargs)
```
Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it
@ -261,10 +252,6 @@ print(result["chunks"])
<summary> For more control over the generation parameters, use the model + processor API directly: </summary>
Ad-hoc generation arguments can be passed to `model.generate`, including `num_beams` for beam-search, `return_timestamps`
for segment-level timestamps, and `prompt_ids` for prompting. See the [docstrings](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperForConditionalGeneration.generate)
for more details.
```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
@ -277,100 +264,7 @@ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]
input_features = processor(
sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features
input_features = input_features.to(device, dtype=torch_dtype)
gen_kwargs = {
"max_new_tokens": 128,
"num_beams": 1,
"return_timestamps": False,
}
pred_ids = model.generate(input_features, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])
print(pred_text)
```
</details>
### Sequential Long-Form
This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds),
and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).
The sequential long-form algorithm should be used in either of the following scenarios:
1. Transcription accuracy is the most important factor, and latency is less of a consideration
2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
class can be used to transcribe long audio files with the sequential algorithm as follows:
```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
```
<details>
<summary> For more control over the generation parameters, use the model + processor API directly: </summary>
```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)
@ -401,7 +295,7 @@ gen_kwargs = {
"return_timestamps": True,
}
pred_ids = model.generate(**i nputs, **gen_kwargs)
pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
print(pred_text)
@ -409,15 +303,29 @@ print(pred_text)
</details>
## Additional Speed & Memory Improvements
You can apply additional speed and memory improvements to Whisper to further reduce the inference speed and VRAM
requirements.
### Chunked Long-Form
large-v3 remains compatible with the Transformers chunked long-form algorithm. This algorithm should be used when
a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances,
the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the
[Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf)).
Whisper has a receptive field of 30-seconds. To transcribe audios longer than this, one of two long-form algorithms are
required:
1. **Sequential:** uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other
2. **Chunked:** splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries
To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For distil-large-v3, a chunk length of 25-seconds
is optimal. To activate batching over long audio files, pass the argument `batch_size`:
The sequential long-form algorithm should be used in either of the following scenarios:
1. Transcription accuracy is the most important factor, and speed is less of a consideration
2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
Conversely, the chunked algorithm should be used when:
1. Transcription speed is the most important factor
2. You are transcribing a **single** long audio file
By default, Transformers uses the sequential algorithm. To enable the chunked algorithm, pass the `chunk_length_s`
parameter to the `pipeline`. For large-v3, a chunk length of 30-seconds is optimal. To activate batching over long
audio files, pass the argument `batch_size`:
```python
import torch
@ -431,7 +339,7 @@ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)
@ -442,9 +350,8 @@ pipe = pipeline(
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=25,
batch_size=16,
chunk_length_s=30,
batch_size=16, # batch size for inference - set based on your device
torch_dtype=torch_dtype,
device=device,
)
@ -456,16 +363,65 @@ result = pipe(sample)
print(result["text"])
```
### Additional Speed & Memory Improvements
#### Torch compile
You can apply additional speed and memory improvements to Distil-Whisper to further reduce the inference speed and VRAM
requirements. These optimisations primarily target the attention kernel, swapping it from an eager implementation to a
more efficient flash attention version.
The Whisper forward pass is compatible with [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html)
for 4.5x speed-ups.
**Note:** `torch.compile` is currently not compatible with the Chunked long-form algorithm or Flash Attention 2 ⚠️
```python
import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
from tqdm import tqdm
torch.set_float32_matmul_precision("high")
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
).to(device)
# Enable static cache and compile the forward pass
model.generation_config.cache_implementation = "static"
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
# 2 warmup steps
for _ in tqdm(range(2), desc="Warm-up step"):
with sdpa_kernel(SDPBackend.MATH):
result = pipe(sample.copy())
# fast run
with sdpa_kernel(SDPBackend.MATH):
result = pipe(sample.copy())
print(result["text"])
```
#### Flash Attention 2
We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2)
if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU supports it and you are not using [torch.compile](#torch-compile).
To do so, first install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
```
pip install flash-attn --no-build-isolation
@ -473,9 +429,8 @@ pip install flash-attn --no-build-isolation
Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
```diff
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="flash_attention_2")
```python
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")
```
#### Torch Scale-Product-Attention (SDPA)
@ -496,20 +451,36 @@ returns `False`, you need to upgrade your PyTorch version according to the [offi
Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying
`attn_implementation="sdpa"` as follows:
```diff
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa")
```python
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa")
```
For more information about how to use the SDPA refer to the [Transformers SDPA documentation](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention).
#### Torch compile
Coming soon...
## Model details
#### 4-bit and 8-bit Inference
Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model. There are two
flavours of Whisper model: English-only and multilingual. The English-only models were trained on the task of English
speech recognition. The multilingual models were trained simultaneously on multilingual speech recognition and speech
translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio. For speech
translation, the model predicts transcriptions to a *different* language to the audio.
Whisper checkpoints come in five configurations of varying model sizes. The smallest four are available as English-only
and multilingual. The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints
are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The
checkpoints are summarised in the following table with links to the models on the Hub:
| Size | Parameters | English-only | Multilingual |
|----------|------------|------------------------------------------------------|-----------------------------------------------------|
| tiny | 39 M | [](https://huggingface.co/openai/whisper-tiny.en) | [](https://huggingface.co/openai/whisper-tiny) |
| base | 74 M | [](https://huggingface.co/openai/whisper-base.en) | [](https://huggingface.co/openai/whisper-base) |
| small | 244 M | [](https://huggingface.co/openai/whisper-small.en) | [](https://huggingface.co/openai/whisper-small) |
| medium | 769 M | [](https://huggingface.co/openai/whisper-medium.en) | [](https://huggingface.co/openai/whisper-medium) |
| large | 1550 M | x | [](https://huggingface.co/openai/whisper-large) |
| large-v2 | 1550 M | x | [](https://huggingface.co/openai/whisper-large-v2) |
| large-v3 | 1550 M | x | [](https://huggingface.co/openai/whisper-large-v3) |
Coming soon...
## Fine-Tuning
@ -529,7 +500,7 @@ In particular, we caution against using Whisper models to transcribe recordings
## Training Data
The models are trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
The large-v3 checkpoint is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2.
As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.