Update README (#6)

- Update README with langauge/task/timestamp info (7121cfc0bf874953f9a47845177517abf1e0940d) Co-authored-by: Sanchit Gandhi <sanchit-gandhi@users.noreply.huggingface.co>
2023-11-08 12:16:04 +00:00 · 2023-11-08 12:16:04 +00:00 · 4a83ae28a5
commit 4a83ae28a5
parent 4b1ba054cc
1 changed files with 37 additions and 49 deletions
--- a/README.md
+++ b/README.md
@ -172,10 +172,11 @@ pip install --upgrade pip
 pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
 ```
 ### Short-Form Transcription
 The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
-class to transcribe short-form audio files (< 30-seconds) as follows:
+class to transcribe audio files of arbitrary length. Transformers uses a chunked algorithm to transcribe 
 long-form audio files, which in-practice is 9x faster than the sequential algorithm proposed by OpenAI 
 (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)). The batch size should 
 be set based on the specifications of your device:
 ```python
 import torch
@ -201,11 +202,14 @@ pipe = pipeline(
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
 )
-dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
 sample = dataset[0]["audio"]
 result = pipe(sample)
@ -218,59 +222,43 @@ To transcribe a local audio file, simply pass the path to your audio file when y
 + result = pipe("audio.mp3")
 ```
-### Long-Form Transcription
+Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it 
-
+can be passed as an argument to the pipeline:
 Through Transformers Whisper uses a chunked algorithm to transcribe long-form audio files (> 30-seconds). In practice, this chunked long-form algorithm 
 is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
 To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. To activate batching, pass the argument `batch_size`:
 ```python
-import torch
+result = pipe(sample, generate_kwargs={"language": "english"})
 from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
 from datasets import load_dataset
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
 model_id = "openai/whisper-large-v3"
 model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
 )
 model.to(device)
 processor = AutoProcessor.from_pretrained(model_id)
 pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
 )
 dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
 sample = dataset[0]["audio"]
 result = pipe(sample)
 print(result["text"])
 ```
-<!---
+By default, Whisper performs the task of *speech transcription*, where the source audio language is the same as the target
-**Tip:** The pipeline can also be used to transcribe an audio file from a remote URL, for example:
+text language. To perform *speech translation*, where the target text is in English, set the task to `"translate"`:
 ```python
-result = pipe("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav")
+result = pipe(sample, generate_kwargs={"task": "translate"})
 ```
 --->
-### Speculative Decoding
+Finally, the model can be made to predict timestamps. For sentence-level timestamps, pass the `return_timestamps` argument:
 ```python
 result = pipe(sample, return_timestamps=True)
 print(result["chunks"])
 ```
 And for word-level timestamps:
 ```python
 result = pipe(sample, return_timestamps="word")
 print(result["chunks"])
 ```
 The above arguments can be used in isolation or in combination. For example, to perform the task of speech transcription 
 where the source audio is in French, and we want to return sentence-level timestamps, the following can be used:
 ```python
 result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "french", "task": "translate"})
 print(result["chunks"])
 ```
 ## Speculative Decoding
 Whisper `tiny` can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
 ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in