diff --git a/README.md b/README.md
index 0fb10eb..93abeca 100644
--- a/README.md
+++ b/README.md
@@ -114,68 +114,39 @@ license: apache-2.0
# Whisper
-Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours
-of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains **without** the need
-for fine-tuning.
+Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper
+[Robust Speech Recognition via Large-Scale Weak Supervision](https://huggingface.co/papers/2212.04356) by Alec Radford
+et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many
+datasets and domains in a zero-shot setting.
-Whisper was proposed in the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
-by Alec Radford et al. from OpenAI. The original code repository can be found [here](https://github.com/openai/whisper).
+Whisper large-v3 has the same architecture as the previous [large](https://huggingface.co/openai/whisper-large) and [large-v2](https://huggingface.co/openai/whisper-large-v2)
+models, except for the following minor differences:
-Whisper `large-v3` has the same architecture as the previous large models except the following minor differences:
-
-1. The input uses 128 Mel frequency bins instead of 80
+1. The spectrogram input uses 128 Mel frequency bins instead of 80
2. A new language token for Cantonese
-The Whisper `large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
-The model was trained for 2.0 epochs over this mixture dataset.
+The Whisper large-v3 model was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled
+audio collected using Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) . The model was trained for 2.0 epochs over this mixture dataset.
-The `large-v3` model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors compared to Whisper `large-v2`.
+The large-v3 model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors
+compared to Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) . For more details on the different checkpoints available, refer to the section [Model details](#model-details).
-
-**Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
-copied and pasted from the original model card.
-
-## Model details
-
-Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model.
-It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
-
-The models were trained on either English-only data or multilingual data. The English-only models were trained
-on the task of speech recognition. The multilingual models were trained on both speech recognition and speech
-translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio.
-For speech translation, the model predicts transcriptions to a *different* language to the audio.
-
-Whisper checkpoints come in five configurations of varying model sizes.
-The smallest four are trained on either English-only or multilingual data.
-The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints
-are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The
-checkpoints are summarised in the following table with links to the models on the Hub:
-
-| Size | Parameters | English-only | Multilingual |
-|----------|------------|------------------------------------------------------|-----------------------------------------------------|
-| tiny | 39 M | [✓](https://huggingface.co/openai/whisper-tiny.en) | [✓](https://huggingface.co/openai/whisper-tiny) |
-| base | 74 M | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) |
-| small | 244 M | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) |
-| medium | 769 M | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) |
-| large | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large) |
-| large-v2 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v2) |
-| large-v3 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v3) |
+**Disclaimer**: Content for this model card has partly been written by the 🤗 Hugging Face team, and partly copied and
+pasted from the original model card.
## Usage
-Whisper `large-v3` is supported in Hugging Face 🤗 Transformers. To run the model, first
-install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy
-audio dataset from the Hugging Face Hub:
+Whisper large-v3 is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers
+library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and
+🤗 Accelerate to reduce the model loading time:
```bash
pip install --upgrade pip
-pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
+pip install --upgrade transformers datasets[audio] accelerate
```
-### Short-Form Transcription
-
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
-class to transcribe short-form audio files (< 30-seconds) as follows:
+class to transcribe audios of arbitrary length:
```python
import torch
@@ -200,10 +171,6 @@ pipe = pipeline(
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
- max_new_tokens=128,
- chunk_length_s=30,
- batch_size=16,
- return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
@@ -216,9 +183,33 @@ print(result["text"])
```
To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
-```diff
-- result = pipe(sample)
-+ result = pipe("audio.mp3")
+
+```python
+result = pipe("audio.mp3")
+```
+
+Multiple audio files can be transcribed in parallel by specifying them as a list and setting the `batch_size` parameter:
+
+```python
+result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)
+```
+
+Transformers is compatible with all Whisper decoding strategies, such as temperature fallback and condition on previous
+tokens. The following example demonstrates how to enable these heuristics:
+
+```python
+generate_kwargs = {
+ "max_new_tokens": 448,
+ "num_beams": 1,
+ "condition_on_prev_tokens": False,
+ "compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
+ "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
+ "logprob_threshold": -1.0,
+ "no_speech_threshold": 0.6,
+ "return_timestamps": True,
+}
+
+result = pipe(sample, generate_kwargs=generate_kwargs)
```
Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it
@@ -261,10 +252,6 @@ print(result["chunks"])
For more control over the generation parameters, use the model + processor API directly:
-Ad-hoc generation arguments can be passed to `model.generate`, including `num_beams` for beam-search, `return_timestamps`
-for segment-level timestamps, and `prompt_ids` for prompting. See the [docstrings](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperForConditionalGeneration.generate)
-for more details.
-
```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
@@ -277,100 +264,7 @@ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
-)
-model.to(device)
-
-processor = AutoProcessor.from_pretrained(model_id)
-
-dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
-dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
-sample = dataset[0]["audio"]
-
-input_features = processor(
- sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
-).input_features
-
-input_features = input_features.to(device, dtype=torch_dtype)
-
-gen_kwargs = {
- "max_new_tokens": 128,
- "num_beams": 1,
- "return_timestamps": False,
-}
-
-pred_ids = model.generate(input_features, **gen_kwargs)
-pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])
-
-print(pred_text)
-```
-
-
-
-### Sequential Long-Form
-
-This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds),
-and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).
-
-The sequential long-form algorithm should be used in either of the following scenarios:
-1. Transcription accuracy is the most important factor, and latency is less of a consideration
-2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
-
-The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
-class can be used to transcribe long audio files with the sequential algorithm as follows:
-
-```python
-import torch
-from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
-from datasets import load_dataset
-
-
-device = "cuda:0" if torch.cuda.is_available() else "cpu"
-torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
-
-model_id = "openai/whisper-large-v3"
-
-model = AutoModelForSpeechSeq2Seq.from_pretrained(
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
-)
-model.to(device)
-
-processor = AutoProcessor.from_pretrained(model_id)
-
-pipe = pipeline(
- "automatic-speech-recognition",
- model=model,
- tokenizer=processor.tokenizer,
- feature_extractor=processor.feature_extractor,
- max_new_tokens=128,
- torch_dtype=torch_dtype,
- device=device,
-)
-
-dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
-sample = dataset[0]["audio"]
-
-result = pipe(sample)
-print(result["text"])
-```
-
-
-
- For more control over the generation parameters, use the model + processor API directly:
-
-```python
-import torch
-from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
-from datasets import Audio, load_dataset
-
-
-device = "cuda:0" if torch.cuda.is_available() else "cpu"
-torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
-
-model_id = "openai/whisper-large-v3"
-
-model = AutoModelForSpeechSeq2Seq.from_pretrained(
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)
@@ -401,7 +295,7 @@ gen_kwargs = {
"return_timestamps": True,
}
-pred_ids = model.generate(**i nputs, **gen_kwargs)
+pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
print(pred_text)
@@ -409,15 +303,29 @@ print(pred_text)
+## Additional Speed & Memory Improvements
+
+You can apply additional speed and memory improvements to Whisper to further reduce the inference speed and VRAM
+requirements.
+
### Chunked Long-Form
-large-v3 remains compatible with the Transformers chunked long-form algorithm. This algorithm should be used when
-a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances,
-the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the
-[Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf)).
+Whisper has a receptive field of 30-seconds. To transcribe audios longer than this, one of two long-form algorithms are
+required:
+1. **Sequential:** uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other
+2. **Chunked:** splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries
-To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For distil-large-v3, a chunk length of 25-seconds
-is optimal. To activate batching over long audio files, pass the argument `batch_size`:
+The sequential long-form algorithm should be used in either of the following scenarios:
+1. Transcription accuracy is the most important factor, and speed is less of a consideration
+2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
+
+Conversely, the chunked algorithm should be used when:
+1. Transcription speed is the most important factor
+2. You are transcribing a **single** long audio file
+
+By default, Transformers uses the sequential algorithm. To enable the chunked algorithm, pass the `chunk_length_s`
+parameter to the `pipeline`. For large-v3, a chunk length of 30-seconds is optimal. To activate batching over long
+audio files, pass the argument `batch_size`:
```python
import torch
@@ -431,7 +339,7 @@ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)
@@ -442,9 +350,8 @@ pipe = pipeline(
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
- max_new_tokens=128,
- chunk_length_s=25,
- batch_size=16,
+ chunk_length_s=30,
+ batch_size=16, # batch size for inference - set based on your device
torch_dtype=torch_dtype,
device=device,
)
@@ -456,16 +363,65 @@ result = pipe(sample)
print(result["text"])
```
-### Additional Speed & Memory Improvements
+#### Torch compile
-You can apply additional speed and memory improvements to Distil-Whisper to further reduce the inference speed and VRAM
-requirements. These optimisations primarily target the attention kernel, swapping it from an eager implementation to a
-more efficient flash attention version.
+The Whisper forward pass is compatible with [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html)
+for 4.5x speed-ups.
+
+**Note:** `torch.compile` is currently not compatible with the Chunked long-form algorithm or Flash Attention 2 ⚠️
+
+```python
+import torch
+from torch.nn.attention import SDPBackend, sdpa_kernel
+from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
+from datasets import load_dataset
+from tqdm import tqdm
+
+torch.set_float32_matmul_precision("high")
+
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+
+model_id = "openai/whisper-large-v3"
+
+model = AutoModelForSpeechSeq2Seq.from_pretrained(
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
+).to(device)
+
+# Enable static cache and compile the forward pass
+model.generation_config.cache_implementation = "static"
+model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
+
+processor = AutoProcessor.from_pretrained(model_id)
+
+pipe = pipeline(
+ "automatic-speech-recognition",
+ model=model,
+ tokenizer=processor.tokenizer,
+ feature_extractor=processor.feature_extractor,
+ torch_dtype=torch_dtype,
+ device=device,
+)
+
+dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
+sample = dataset[0]["audio"]
+
+# 2 warmup steps
+for _ in tqdm(range(2), desc="Warm-up step"):
+ with sdpa_kernel(SDPBackend.MATH):
+ result = pipe(sample.copy())
+
+# fast run
+with sdpa_kernel(SDPBackend.MATH):
+ result = pipe(sample.copy())
+
+print(result["text"])
+```
#### Flash Attention 2
-We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2)
-if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
+We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU supports it and you are not using [torch.compile](#torch-compile).
+To do so, first install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
```
pip install flash-attn --no-build-isolation
@@ -473,9 +429,8 @@ pip install flash-attn --no-build-isolation
Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
-```diff
-- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
-+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="flash_attention_2")
+```python
+model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")
```
#### Torch Scale-Product-Attention (SDPA)
@@ -496,20 +451,36 @@ returns `False`, you need to upgrade your PyTorch version according to the [offi
Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying
`attn_implementation="sdpa"` as follows:
-```diff
-- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
-+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa")
+```python
+model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa")
```
For more information about how to use the SDPA refer to the [Transformers SDPA documentation](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention).
-#### Torch compile
-Coming soon...
+## Model details
-#### 4-bit and 8-bit Inference
+Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model. There are two
+flavours of Whisper model: English-only and multilingual. The English-only models were trained on the task of English
+speech recognition. The multilingual models were trained simultaneously on multilingual speech recognition and speech
+translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio. For speech
+translation, the model predicts transcriptions to a *different* language to the audio.
+
+Whisper checkpoints come in five configurations of varying model sizes. The smallest four are available as English-only
+and multilingual. The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints
+are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The
+checkpoints are summarised in the following table with links to the models on the Hub:
+
+| Size | Parameters | English-only | Multilingual |
+|----------|------------|------------------------------------------------------|-----------------------------------------------------|
+| tiny | 39 M | [✓](https://huggingface.co/openai/whisper-tiny.en) | [✓](https://huggingface.co/openai/whisper-tiny) |
+| base | 74 M | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) |
+| small | 244 M | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) |
+| medium | 769 M | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) |
+| large | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large) |
+| large-v2 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v2) |
+| large-v3 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v3) |
-Coming soon...
## Fine-Tuning
@@ -529,7 +500,7 @@ In particular, we caution against using Whisper models to transcribe recordings
## Training Data
-The models are trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
+The large-v3 checkpoint is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2.
As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.