diff --git a/README.md b/README.md index 0fb10eb..93abeca 100644 --- a/README.md +++ b/README.md @@ -114,68 +114,39 @@ license: apache-2.0 # Whisper -Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours -of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains **without** the need -for fine-tuning. +Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper +[Robust Speech Recognition via Large-Scale Weak Supervision](https://huggingface.co/papers/2212.04356) by Alec Radford +et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many +datasets and domains in a zero-shot setting. -Whisper was proposed in the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356) -by Alec Radford et al. from OpenAI. The original code repository can be found [here](https://github.com/openai/whisper). +Whisper large-v3 has the same architecture as the previous [large](https://huggingface.co/openai/whisper-large) and [large-v2](https://huggingface.co/openai/whisper-large-v2) +models, except for the following minor differences: -Whisper `large-v3` has the same architecture as the previous large models except the following minor differences: - -1. The input uses 128 Mel frequency bins instead of 80 +1. The spectrogram input uses 128 Mel frequency bins instead of 80 2. A new language token for Cantonese -The Whisper `large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`. -The model was trained for 2.0 epochs over this mixture dataset. +The Whisper large-v3 model was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled +audio collected using Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) . The model was trained for 2.0 epochs over this mixture dataset. -The `large-v3` model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors compared to Whisper `large-v2`. +The large-v3 model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors +compared to Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) . For more details on the different checkpoints available, refer to the section [Model details](#model-details). - -**Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were -copied and pasted from the original model card. - -## Model details - -Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model. -It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`. - -The models were trained on either English-only data or multilingual data. The English-only models were trained -on the task of speech recognition. The multilingual models were trained on both speech recognition and speech -translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio. -For speech translation, the model predicts transcriptions to a *different* language to the audio. - -Whisper checkpoints come in five configurations of varying model sizes. -The smallest four are trained on either English-only or multilingual data. -The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints -are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The -checkpoints are summarised in the following table with links to the models on the Hub: - -| Size | Parameters | English-only | Multilingual | -|----------|------------|------------------------------------------------------|-----------------------------------------------------| -| tiny | 39 M | [✓](https://huggingface.co/openai/whisper-tiny.en) | [✓](https://huggingface.co/openai/whisper-tiny) | -| base | 74 M | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) | -| small | 244 M | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) | -| medium | 769 M | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) | -| large | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large) | -| large-v2 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v2) | -| large-v3 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v3) | +**Disclaimer**: Content for this model card has partly been written by the 🤗 Hugging Face team, and partly copied and +pasted from the original model card. ## Usage -Whisper `large-v3` is supported in Hugging Face 🤗 Transformers. To run the model, first -install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy -audio dataset from the Hugging Face Hub: +Whisper large-v3 is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers +library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and +🤗 Accelerate to reduce the model loading time: ```bash pip install --upgrade pip -pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio] +pip install --upgrade transformers datasets[audio] accelerate ``` -### Short-Form Transcription - The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) -class to transcribe short-form audio files (< 30-seconds) as follows: +class to transcribe audios of arbitrary length: ```python import torch @@ -200,10 +171,6 @@ pipe = pipeline( model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, - max_new_tokens=128, - chunk_length_s=30, - batch_size=16, - return_timestamps=True, torch_dtype=torch_dtype, device=device, ) @@ -216,9 +183,33 @@ print(result["text"]) ``` To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline: -```diff -- result = pipe(sample) -+ result = pipe("audio.mp3") + +```python +result = pipe("audio.mp3") +``` + +Multiple audio files can be transcribed in parallel by specifying them as a list and setting the `batch_size` parameter: + +```python +result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2) +``` + +Transformers is compatible with all Whisper decoding strategies, such as temperature fallback and condition on previous +tokens. The following example demonstrates how to enable these heuristics: + +```python +generate_kwargs = { + "max_new_tokens": 448, + "num_beams": 1, + "condition_on_prev_tokens": False, + "compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space) + "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), + "logprob_threshold": -1.0, + "no_speech_threshold": 0.6, + "return_timestamps": True, +} + +result = pipe(sample, generate_kwargs=generate_kwargs) ``` Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it @@ -261,10 +252,6 @@ print(result["chunks"]) For more control over the generation parameters, use the model + processor API directly: -Ad-hoc generation arguments can be passed to `model.generate`, including `num_beams` for beam-search, `return_timestamps` -for segment-level timestamps, and `prompt_ids` for prompting. See the [docstrings](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperForConditionalGeneration.generate) -for more details. - ```python import torch from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor @@ -277,100 +264,7 @@ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model_id = "openai/whisper-large-v3" model = AutoModelForSpeechSeq2Seq.from_pretrained( - model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True -) -model.to(device) - -processor = AutoProcessor.from_pretrained(model_id) - -dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") -dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate)) -sample = dataset[0]["audio"] - -input_features = processor( - sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt" -).input_features - -input_features = input_features.to(device, dtype=torch_dtype) - -gen_kwargs = { - "max_new_tokens": 128, - "num_beams": 1, - "return_timestamps": False, -} - -pred_ids = model.generate(input_features, **gen_kwargs) -pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"]) - -print(pred_text) -``` - - - -### Sequential Long-Form - -This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds), -and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form). - -The sequential long-form algorithm should be used in either of the following scenarios: -1. Transcription accuracy is the most important factor, and latency is less of a consideration -2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate - -The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) -class can be used to transcribe long audio files with the sequential algorithm as follows: - -```python -import torch -from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline -from datasets import load_dataset - - -device = "cuda:0" if torch.cuda.is_available() else "cpu" -torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 - -model_id = "openai/whisper-large-v3" - -model = AutoModelForSpeechSeq2Seq.from_pretrained( - model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True -) -model.to(device) - -processor = AutoProcessor.from_pretrained(model_id) - -pipe = pipeline( - "automatic-speech-recognition", - model=model, - tokenizer=processor.tokenizer, - feature_extractor=processor.feature_extractor, - max_new_tokens=128, - torch_dtype=torch_dtype, - device=device, -) - -dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation") -sample = dataset[0]["audio"] - -result = pipe(sample) -print(result["text"]) -``` - -
- - For more control over the generation parameters, use the model + processor API directly: - -```python -import torch -from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor -from datasets import Audio, load_dataset - - -device = "cuda:0" if torch.cuda.is_available() else "cpu" -torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 - -model_id = "openai/whisper-large-v3" - -model = AutoModelForSpeechSeq2Seq.from_pretrained( - model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True + model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True ) model.to(device) @@ -401,7 +295,7 @@ gen_kwargs = { "return_timestamps": True, } -pred_ids = model.generate(**i nputs, **gen_kwargs) +pred_ids = model.generate(**inputs, **gen_kwargs) pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False) print(pred_text) @@ -409,15 +303,29 @@ print(pred_text)
+## Additional Speed & Memory Improvements + +You can apply additional speed and memory improvements to Whisper to further reduce the inference speed and VRAM +requirements. + ### Chunked Long-Form -large-v3 remains compatible with the Transformers chunked long-form algorithm. This algorithm should be used when -a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances, -the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the -[Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf)). +Whisper has a receptive field of 30-seconds. To transcribe audios longer than this, one of two long-form algorithms are +required: +1. **Sequential:** uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other +2. **Chunked:** splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries -To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For distil-large-v3, a chunk length of 25-seconds -is optimal. To activate batching over long audio files, pass the argument `batch_size`: +The sequential long-form algorithm should be used in either of the following scenarios: +1. Transcription accuracy is the most important factor, and speed is less of a consideration +2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate + +Conversely, the chunked algorithm should be used when: +1. Transcription speed is the most important factor +2. You are transcribing a **single** long audio file + +By default, Transformers uses the sequential algorithm. To enable the chunked algorithm, pass the `chunk_length_s` +parameter to the `pipeline`. For large-v3, a chunk length of 30-seconds is optimal. To activate batching over long +audio files, pass the argument `batch_size`: ```python import torch @@ -431,7 +339,7 @@ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model_id = "openai/whisper-large-v3" model = AutoModelForSpeechSeq2Seq.from_pretrained( - model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True + model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True ) model.to(device) @@ -442,9 +350,8 @@ pipe = pipeline( model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, - max_new_tokens=128, - chunk_length_s=25, - batch_size=16, + chunk_length_s=30, + batch_size=16, # batch size for inference - set based on your device torch_dtype=torch_dtype, device=device, ) @@ -456,16 +363,65 @@ result = pipe(sample) print(result["text"]) ``` -### Additional Speed & Memory Improvements +#### Torch compile -You can apply additional speed and memory improvements to Distil-Whisper to further reduce the inference speed and VRAM -requirements. These optimisations primarily target the attention kernel, swapping it from an eager implementation to a -more efficient flash attention version. +The Whisper forward pass is compatible with [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) +for 4.5x speed-ups. + +**Note:** `torch.compile` is currently not compatible with the Chunked long-form algorithm or Flash Attention 2 ⚠️ + +```python +import torch +from torch.nn.attention import SDPBackend, sdpa_kernel +from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline +from datasets import load_dataset +from tqdm import tqdm + +torch.set_float32_matmul_precision("high") + +device = "cuda:0" if torch.cuda.is_available() else "cpu" +torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 + +model_id = "openai/whisper-large-v3" + +model = AutoModelForSpeechSeq2Seq.from_pretrained( + model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True +).to(device) + +# Enable static cache and compile the forward pass +model.generation_config.cache_implementation = "static" +model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True) + +processor = AutoProcessor.from_pretrained(model_id) + +pipe = pipeline( + "automatic-speech-recognition", + model=model, + tokenizer=processor.tokenizer, + feature_extractor=processor.feature_extractor, + torch_dtype=torch_dtype, + device=device, +) + +dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation") +sample = dataset[0]["audio"] + +# 2 warmup steps +for _ in tqdm(range(2), desc="Warm-up step"): + with sdpa_kernel(SDPBackend.MATH): + result = pipe(sample.copy()) + +# fast run +with sdpa_kernel(SDPBackend.MATH): + result = pipe(sample.copy()) + +print(result["text"]) +``` #### Flash Attention 2 -We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) -if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention): +We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU supports it and you are not using [torch.compile](#torch-compile). +To do so, first install [Flash Attention](https://github.com/Dao-AILab/flash-attention): ``` pip install flash-attn --no-build-isolation @@ -473,9 +429,8 @@ pip install flash-attn --no-build-isolation Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`: -```diff -- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True) -+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="flash_attention_2") +```python +model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2") ``` #### Torch Scale-Product-Attention (SDPA) @@ -496,20 +451,36 @@ returns `False`, you need to upgrade your PyTorch version according to the [offi Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying `attn_implementation="sdpa"` as follows: -```diff -- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True) -+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa") +```python +model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa") ``` For more information about how to use the SDPA refer to the [Transformers SDPA documentation](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention). -#### Torch compile -Coming soon... +## Model details -#### 4-bit and 8-bit Inference +Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model. There are two +flavours of Whisper model: English-only and multilingual. The English-only models were trained on the task of English +speech recognition. The multilingual models were trained simultaneously on multilingual speech recognition and speech +translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio. For speech +translation, the model predicts transcriptions to a *different* language to the audio. + +Whisper checkpoints come in five configurations of varying model sizes. The smallest four are available as English-only +and multilingual. The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints +are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The +checkpoints are summarised in the following table with links to the models on the Hub: + +| Size | Parameters | English-only | Multilingual | +|----------|------------|------------------------------------------------------|-----------------------------------------------------| +| tiny | 39 M | [✓](https://huggingface.co/openai/whisper-tiny.en) | [✓](https://huggingface.co/openai/whisper-tiny) | +| base | 74 M | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) | +| small | 244 M | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) | +| medium | 769 M | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) | +| large | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large) | +| large-v2 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v2) | +| large-v3 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v3) | -Coming soon... ## Fine-Tuning @@ -529,7 +500,7 @@ In particular, we caution against using Whisper models to transcribe recordings ## Training Data -The models are trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`. +The large-v3 checkpoint is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2. As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.