Update README.md (#144)

- Update README.md (29aee6b8a7c9c86b46f63bf6fc2331151935026b)
2024-08-06 15:41:16 +00:00 · 2024-08-06 15:41:16 +00:00 · afda370583
commit afda370583
parent 3d0618527a
1 changed files with 151 additions and 180 deletions
--- a/README.md
+++ b/README.md
@ -114,68 +114,39 @@ license: apache-2.0
 # Whisper
-Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours 
+Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper 
-of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains **without** the need 
+[Robust Speech Recognition via Large-Scale Weak Supervision](https://huggingface.co/papers/2212.04356) by Alec Radford 
-for fine-tuning.
+et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many 
 datasets and domains in a zero-shot setting.
-Whisper was proposed in the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356) 
+Whisper large-v3 has the same architecture as the previous [large](https://huggingface.co/openai/whisper-large) and [large-v2](https://huggingface.co/openai/whisper-large-v2) 
-by Alec Radford et al. from OpenAI. The original code repository can be found [here](https://github.com/openai/whisper).
+models, except for the following minor differences:
-Whisper `large-v3` has the same architecture as the previous large models except the following minor differences:
+1. The spectrogram input uses 128 Mel frequency bins instead of 80
 1. The input uses 128 Mel frequency bins instead of 80
 2. A new language token for Cantonese
-The Whisper `large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`. 
+The Whisper large-v3 model was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled 
-The model was trained for 2.0 epochs over this mixture dataset.
+audio collected using Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) . The model was trained for 2.0 epochs over this mixture dataset.
-The `large-v3` model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors compared to Whisper `large-v2`.
+The large-v3 model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors 
 compared to Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) . For more details on the different checkpoints available, refer to the section [Model details](#model-details).
-
+**Disclaimer**: Content for this model card has partly been written by the 🤗 Hugging Face team, and partly copied and 
-**Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were 
+pasted from the original model card.
 copied and pasted from the original model card.
 ## Model details
 Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model. 
 It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
 The models were trained on either English-only data or multilingual data. The English-only models were trained 
 on the task of speech recognition. The multilingual models were trained on both speech recognition and speech 
 translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio. 
 For speech translation, the model predicts transcriptions to a *different* language to the audio.
 Whisper checkpoints come in five configurations of varying model sizes.
 The smallest four are trained on either English-only or multilingual data.
 The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints 
 are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The 
 checkpoints are summarised in the following table with links to the models on the Hub:
 | Size     | Parameters | English-only                                         | Multilingual                                        |
 |----------|------------|------------------------------------------------------|-----------------------------------------------------|
 | tiny     | 39 M       | [✓](https://huggingface.co/openai/whisper-tiny.en)   | [✓](https://huggingface.co/openai/whisper-tiny)     |
 | base     | 74 M       | [✓](https://huggingface.co/openai/whisper-base.en)   | [✓](https://huggingface.co/openai/whisper-base)     |
 | small    | 244 M      | [✓](https://huggingface.co/openai/whisper-small.en)  | [✓](https://huggingface.co/openai/whisper-small)    |
 | medium   | 769 M      | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium)   |
 | large    | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large)    |
 | large-v2 | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v2) |
 | large-v3 | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v3) |
 ## Usage
-Whisper `large-v3` is supported in Hugging Face 🤗 Transformers. To run the model, first 
+Whisper large-v3 is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers 
-install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy 
+library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and 
-audio dataset from the Hugging Face Hub:
+🤗 Accelerate to reduce the model loading time:
 ```bash
 pip install --upgrade pip
-pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
+pip install --upgrade transformers datasets[audio] accelerate
 ```
 ### Short-Form Transcription
 The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
-class to transcribe short-form audio files (< 30-seconds) as follows:
+class to transcribe audios of arbitrary length:
 ```python
 import torch
@ -200,10 +171,6 @@ pipe = pipeline(
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
 )
@ -216,9 +183,33 @@ print(result["text"])
 ```
 To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
-```diff
+
- result = pipe(sample)
+```python
-+ result = pipe("audio.mp3")
+result = pipe("audio.mp3")
 ```
 Multiple audio files can be transcribed in parallel by specifying them as a list and setting the `batch_size` parameter:
 ```python
 result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)
 ```
 Transformers is compatible with all Whisper decoding strategies, such as temperature fallback and condition on previous 
 tokens. The following example demonstrates how to enable these heuristics:
 ```python
 generate_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
 }
 result = pipe(sample, generate_kwargs=generate_kwargs)
 ```
 Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it 
@ -261,10 +252,6 @@ print(result["chunks"])
 <summary> For more control over the generation parameters, use the model + processor API directly: </summary>
 Ad-hoc generation arguments can be passed to `model.generate`, including `num_beams` for beam-search, `return_timestamps` 
 for segment-level timestamps, and `prompt_ids` for prompting. See the [docstrings](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperForConditionalGeneration.generate)
 for more details.
 ```python
 import torch
 from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
@ -277,100 +264,7 @@ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
 model_id = "openai/whisper-large-v3"
 model = AutoModelForSpeechSeq2Seq.from_pretrained(
-    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
+    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
 )
 model.to(device)
 processor = AutoProcessor.from_pretrained(model_id)
 dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
 dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
 sample = dataset[0]["audio"]
 input_features = processor(
  sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
 ).input_features
 input_features = input_features.to(device, dtype=torch_dtype)
 gen_kwargs = {
  "max_new_tokens": 128,
  "num_beams": 1,
  "return_timestamps": False,
 }
 pred_ids = model.generate(input_features, **gen_kwargs)
 pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])
 print(pred_text)
 ```
 </details>
 ### Sequential Long-Form
 This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds),
 and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).
 The sequential long-form algorithm should be used in either of the following scenarios:
 1. Transcription accuracy is the most important factor, and latency is less of a consideration
 2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
 The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) 
 class can be used to transcribe long audio files with the sequential algorithm as follows: 
 ```python
 import torch
 from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
 from datasets import load_dataset
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
 model_id = "openai/whisper-large-v3"
 model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
 )
 model.to(device)
 processor = AutoProcessor.from_pretrained(model_id)
 pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
 )
 dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
 sample = dataset[0]["audio"]
 result = pipe(sample)
 print(result["text"])
 ```
 <details>
 <summary> For more control over the generation parameters, use the model + processor API directly: </summary>
 ```python
 import torch
 from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
 from datasets import Audio, load_dataset
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
 model_id = "openai/whisper-large-v3"
 model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
 )
 model.to(device)
@ -409,15 +303,29 @@ print(pred_text)
 </details>
 ## Additional Speed & Memory Improvements
 You can apply additional speed and memory improvements to Whisper to further reduce the inference speed and VRAM 
 requirements.
 ### Chunked Long-Form
-large-v3 remains compatible with the Transformers chunked long-form algorithm. This algorithm should be used when 
+Whisper has a receptive field of 30-seconds. To transcribe audios longer than this, one of two long-form algorithms are
-a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances, 
+required:
-the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the 
+1. **Sequential:** uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other
-[Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf)).
+2. **Chunked:** splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries
-To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For distil-large-v3, a chunk length of 25-seconds
+The sequential long-form algorithm should be used in either of the following scenarios:
-is optimal. To activate batching over long audio files, pass the argument `batch_size`:
+1. Transcription accuracy is the most important factor, and speed is less of a consideration
 2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
 Conversely, the chunked algorithm should be used when:
 1. Transcription speed is the most important factor
 2. You are transcribing a **single** long audio file
 By default, Transformers uses the sequential algorithm. To enable the chunked algorithm, pass the `chunk_length_s` 
 parameter to the `pipeline`. For large-v3, a chunk length of 30-seconds is optimal. To activate batching over long 
 audio files, pass the argument `batch_size`:
 ```python
 import torch
@ -431,7 +339,7 @@ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
 model_id = "openai/whisper-large-v3"
 model = AutoModelForSpeechSeq2Seq.from_pretrained(
-    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
+    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
 )
 model.to(device)
@ -442,9 +350,8 @@ pipe = pipeline(
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
-    max_new_tokens=128,
+    chunk_length_s=30,
-    chunk_length_s=25,
+    batch_size=16,  # batch size for inference - set based on your device
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
 )
@ -456,16 +363,65 @@ result = pipe(sample)
 print(result["text"])
 ```
-### Additional Speed & Memory Improvements
+#### Torch compile
-You can apply additional speed and memory improvements to Distil-Whisper to further reduce the inference speed and VRAM 
+The Whisper forward pass is compatible with [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html)
-requirements. These optimisations primarily target the attention kernel, swapping it from an eager implementation to a 
+for 4.5x speed-ups.
-more efficient flash attention version.
+
 **Note:** `torch.compile` is currently not compatible with the Chunked long-form algorithm or Flash Attention 2 ⚠️
 ```python
 import torch
 from torch.nn.attention import SDPBackend, sdpa_kernel
 from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
 from datasets import load_dataset
 from tqdm import tqdm
 torch.set_float32_matmul_precision("high")
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
 model_id = "openai/whisper-large-v3"
 model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
 ).to(device)
 # Enable static cache and compile the forward pass
 model.generation_config.cache_implementation = "static"
 model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
 processor = AutoProcessor.from_pretrained(model_id)
 pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
 )
 dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
 sample = dataset[0]["audio"]
 # 2 warmup steps
 for _ in tqdm(range(2), desc="Warm-up step"):
    with sdpa_kernel(SDPBackend.MATH):
        result = pipe(sample.copy())
 # fast run
 with sdpa_kernel(SDPBackend.MATH):
    result = pipe(sample.copy())
 print(result["text"])
 ```
 #### Flash Attention 2
-We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) 
+We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU supports it and you are not using [torch.compile](#torch-compile). 
-if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
+To do so, first install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
 ```
 pip install flash-attn --no-build-isolation
@ -473,9 +429,8 @@ pip install flash-attn --no-build-isolation
 Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
-```diff
+```python
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")
 + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="flash_attention_2")
 ```
 #### Torch Scale-Product-Attention (SDPA)
@ -496,20 +451,36 @@ returns `False`, you need to upgrade your PyTorch version according to the [offi
 Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying 
 `attn_implementation="sdpa"` as follows:
-```diff
+```python
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa")
 + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa")
 ```
 For more information about how to use the SDPA refer to the [Transformers SDPA documentation](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention).
 #### Torch compile
-Coming soon...
+## Model details
-#### 4-bit and 8-bit Inference
+Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model. There are two
 flavours of Whisper model: English-only and multilingual. The English-only models were trained on the task of English 
 speech recognition. The multilingual models were trained simultaneously on multilingual speech recognition and speech 
 translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio. For speech 
 translation, the model predicts transcriptions to a *different* language to the audio.
 Whisper checkpoints come in five configurations of varying model sizes. The smallest four are available as English-only 
 and multilingual. The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints 
 are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The 
 checkpoints are summarised in the following table with links to the models on the Hub:
 | Size     | Parameters | English-only                                         | Multilingual                                        |
 |----------|------------|------------------------------------------------------|-----------------------------------------------------|
 | tiny     | 39 M       | [✓](https://huggingface.co/openai/whisper-tiny.en)   | [✓](https://huggingface.co/openai/whisper-tiny)     |
 | base     | 74 M       | [✓](https://huggingface.co/openai/whisper-base.en)   | [✓](https://huggingface.co/openai/whisper-base)     |
 | small    | 244 M      | [✓](https://huggingface.co/openai/whisper-small.en)  | [✓](https://huggingface.co/openai/whisper-small)    |
 | medium   | 769 M      | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium)   |
 | large    | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large)    |
 | large-v2 | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v2) |
 | large-v3 | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v3) |
 Coming soon...
 ## Fine-Tuning
@ -529,7 +500,7 @@ In particular, we caution against using Whisper models to transcribe recordings
 ## Training Data
-The models are trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`. 
+The large-v3 checkpoint is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2. 
 As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.