Unlocking Streaming LLMs Response: Your Complete Guide for Easy Understanding
Just streaming, No screaming
Table of contents
What does streaming an LLM's response mean?
Streaming an LLM's response is like getting a sneak peek into its thought process. You know how with ChatGPT, you see the response being generated token by token? That's what we're talking about. But why does it matter? Well, imagine waiting around for 30 seconds just to see the complete response. Not ideal, right? By streaming the response token by token, it's way more efficient. You get to see each part of the response as it's being generated, so no more waiting around for the whole thing to load. It's all about making the interaction smoother and faster.
How to achieve that?
Many frameworks, like ChatOpenAI or OpenAI from Langchain, now support streaming. They offer a parameter called "streaming" that accepts boolean values. Setting this parameter to True enables streaming. Additionally, we can set callbacks to define what actions the model should take during streaming. These callbacks, such as AsyncIteratorCallbackHandler(),StreamingStdOutCallbackHandler()
, and AsyncCallbackHandler()
, offer methods like "on_llm_start", "on_llm_end", "on_llm_error", and "on new token". We can either use the default callbacks or create custom ones to fit our needs. Multiple callbacks can be set by providing them as a list for the callbacks
parameter.
# Example
llm = OpenAI(
model,
api_key,
streaming = True,
callbacks = [StreamingStdOutCallbackHandler()]
)
However, merely enabling streaming and setting callbacks may not solve all our problems. While these parameters help in receiving responses token by token, real-world applications often require more. We typically provide memory to the model and use chains like ConversationalRetrievalChain()
. These chains help in organizing the process. For streaming, a crucial parameter to consider is "condense_question_llm". This parameter becomes essential when we give memory to the model because it allows the LLM to refer to previous queries to understand or answer the current one. Without properly setting this parameter or using the same LLM for streaming and answering queries, issues may arise. With streaming and callbacks set, the model might start streaming the reframed question in the response, which is not desirable. The condence_question_llm
parameter resolves this by reframing the question, leaving the main LLM to focus solely on answering the query passed to it.
condense_question_llm = OpenAI(model)
chain = ConversationalRetrievalChain.from_llm(
llm=llm, retriever=retriever, memory=memory,
get_chat_history=lambda h: h, output_key="response",
condense_question_llm=condense_question_llm,
combine_docs_chain_kwargs={'prompt': prompt}
)
Our journey doesn't end here. While the process we've discussed helps us stream responses in the terminal, what about streaming on endpoints, like FastAPI? Well, we can achieve that too by using the same approach we used to help the model stream responses and incorporating async functions.
For the chain we mentioned earlier, we can utilize async methods and invoke them with a query using chain.invoke(query)
. This allows us to initiate the streaming process. However, we also need a way to collect these tokens as the model generates them. Here's where async iterators come into play. We can utilize the async iterator available for callbacks, like callback.aiter()
, which enables us to collect the response token by token. This way, we can seamlessly stream responses not only in the terminal but also on endpoints like FastAPI.
async def res(query, chain, callback):
task = asyncio.create_task(chain.ainvoke(query))
try:
async for token in callback.aiter():
yield token
await task
and on the fast api side, at the endpoint you an use StreamingResponse to stream this StreamingResponse(res(query, callback))
app = FastAPI()
@app.get("/stream")
async def stream_response(query: str):
ans = res(query, callback)
return StreamingResponse(ans, media_type="text/event-stream")
We've reached the finish line, and now we can witness the response streaming seamlessly on the endpoint. Congratulations on achieving this milestone! ๐