Can batched Streaming Inference be used in real-time streaming?

I'm trying to adopt this [script](https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_streaming_infer.py) for real-time streaming as a PoC, but after processing a couple of sentences the decoder stops processing input and outputs 3 dots(even before max_generation_length is reached).

The only conceptual changes I do to the original script, is reading input audio by chunks from a server. I initialize the batched computer and the buffer once for each client. I believe the batched computer used in the original script is not suited to be used that way. 

Is there an other well-suited decoder I could use for that matter that supports AlignAtt?
What would be a reasonable work-around if not?

My testing parameters are:
	pretrained_name=nvidia/canary-1b-v2 \
	left_context_secs=10 \
	chunk_secs=1 \
	right_context_secs=0.5 \
	batch_size=1 \
	decoding.streaming_policy=alignatt \
	decoding.alignatt_thr=8  \
	decoding.exclude_sink_frames=8 \
	decoding.xatt_scores_layer=-2 \
	+prompt.task=asr \
	+prompt.source_lang=en \
	+prompt.target_lang=en

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can batched Streaming Inference be used in real-time streaming? #15273

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can batched Streaming Inference be used in real-time streaming? #15273

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions