Inference Llama 2 models with real-time response streaming using Amazon SageMaker
AWS Machine Learning
JANUARY 9, 2024
In either case, rather than waiting for the full response, you can adopt the approach of response streaming for your inferences, which sends back chunks of information as soon as they are generated. This creates an interactive experience by allowing you to see partial responses streamed in real time instead of a delayed full response.
Let's personalize your content