You are viewing the v1 docs for LangChain, which is currently under active development. Learn more.
Overview
LLMs are machine learning models that can interpret and generate text like humans. They’re versatile enough to write content, translate languages, summarize, and answer questions without needing special training for each task. In addition to text generation, many models support:- Tool calling - where models call external tools (like databases queries or API calls) and use results in their responses.
- Structured output - where the model’s response is constrained match a schema.
- Multimodal - where models can process and return data other than text, such as images, audio, and video.
- Reasoning - where models are able to perform multi-step reasoning to arrive at a conclusion.
Basic usage
The easiest way to get started with a model in LangChain is to useinit_chat_model
to initialize one from a provider of your choice:
init_chat_model
for more detail.
Key methods
Invoke
The model takes messages as input and returns messages after generating a
full response.
Stream
Invoke the model, but stream the response as it is generated in real-time.
Batch
Send multiple requests to a model in a batch for more efficient processing.
In addition to chat models, LangChain provides support for other adjacent
technologies, such as embedding models and vector stores. See the
integrations page for details.
Parameters
A chat model takes parameters that can be used to configure its behavior. The full set of supported parameters vary by model and provider, but common ones include:The name or identifier of the specific model you want to use with a
provider.
The key required for authenticating with the model’s provider. This is
usually issued when you sign up for access to the model. Can often be
accessed by setting an .
Controls the randomness of the model’s output. A higher number makes
responses more creative, while a lower one makes them more deterministic.
A sequence of characters that indicates when the model should stop
generating its output. String or list of strings.
The maximum time (in seconds) to wait for a response from the model before
canceling the request.
Limits the total number of
in the response, effectively controlling how long the output can be.
The maximum number of attempts the system will make to resend a request if
it fails due to issues like network timeouts or rate limits.
To find all the parameters supported by a given chat model, head to the
reference docs.
Invocation
A chat model must be invoked to generate an output. There are three main invocation methods, each suited to different use cases.Each invocation method has an
equivalent, typically prefixed with the letter
'a'
For example: ainvoke()
, astream()
, abatch()
A full list of async methods can be found in the reference docs.Invoke
The most straightforward way to call a model is to useinvoke()
with a single message or a list of messages.
Single message
Conversation history
Stream
Most models can stream their output content while it is being generated. By displaying output progressively, streaming significantly improves user experience, particularly for longer responses. Callingstream()
returns an that yields output chunks as they are produced. You can use a loop to process each chunk in real-time:
invoke()
, which returns a single @[AIMessage
][AIMessage] after the model has
finished generating its full response, stream()
returns multiple
@[AIMessageChunk
][AIMessageChunk] objects, each containing a portion of the output text.
Importantly, each AIMessageChunk
in a stream is designed to be gathered into a
full message via summation:
Construct AIMessage
invoke()
- for example, it can be aggregated into a
message history and passed back to the model as conversational context.
Streaming only works if all steps in the program know how to process an
stream of chunks. For instance, an application that isn’t streaming-capable
would be one that needs to store the entire output in memory before it can
be processed.
Advanced streaming topics
Advanced streaming topics
"Auto-Streaming" Chat Models
"Auto-Streaming" Chat Models
LangChain simplifies streaming from chat models by automatically enabling
streaming mode in certain cases, even when you’re not explicitly calling the
streaming methods. This is particularly useful when you use the
non-streaming invoke method but still want to stream the entire application,
including intermediate results from the chat model.In LangGraph agents, for example, you can
call
model.invoke()
within nodes, but LangChain will automatically
delegate to streaming if running in a streaming mode.How it works
When youinvoke()
a chat model, LangChain will automatically switch to
an internal streaming mode if it detects that you are trying to stream the
overall application. The result of the invocation will be the same as far as
the code that was using invoke is concerned; however, while the chat model
is being streamed, LangChain will take care of invoking on_llm_new_token
events in LangChain’s callback system.Callback events allow LangGraph stream()
and @[astream_events][astream_events()
]
to surface the chat model’s output in real-time.Streaming events
Streaming events
LangChain chat models can also stream semantic events using See the @[astream_events][
astream_events()
.This simplifies filtering based on event types and other metadata, and will
aggregate the full message in the background. See below for an example.astream_events()
] reference for event types and other details.Batch
This section describes a chat model method
batch()
, which parallelizes model
calls client-side. It is distinct from batch APIs supported by inference
providers.Batch
batch()
will only return the final output for the entire batch. If you want to receive
the output for each individual input as it is finishes generating, you can stream results with batch_as_completed()
:
Yield responses upon completion
When using
batch_as_completed()
, results may arrive out of order. Each
includes the input index for matching to reconstruct the original order if
needed.When processing a large number of inputs using See the
batch()
or
batch_as_completed()
, you may want to control the maximum number of
parallel calls. This can be done by setting the max_concurrency
attribute
in the RunnableConfig
dictionary.Batch with max concurrency
RunnableConfig
reference for a full list of supported attributes.Tool calling
Models can request to call tools that perform tasks such as fetching data from a database, searching the web, or running code. Tools are pairings of:- A schema, including the name of the tool, a description, and/or argument definitions (often a JSON schema)
- A function or to execute.
You may hear the term
function calling
. We use this
interchangeably with tool calling
.bindTools()
. In subsequent invocations, the model can choose to call any of the bound tools
as needed.
Some model providers offer built-in tools that can be enabled via model
parameters. Check the respective provider reference for details.
See the tools guide for details and other options for
creating tools.
Binding user tools
Tool execution loop
Tool execution loop
When a model returns tool calls, you need to execute the tools and pass
the results back to the model. This creates a conversation loop where
the model can use tool results to generate its final response.Each
Tool execution loop
ToolMessage
returned by the tool includes a tool_call_id
that
matches the original tool call, helping the model correlate results with
requests.Forcing tool calls
Forcing tool calls
By default, the model has the freedom to choose which bound tool to use
based on the user’s input. However, you might want to force choosing a
tool, ensuring the model uses either a particular tool or any tool
from a given list:
Parallel tool calls
Parallel tool calls
Many models support calling multiple tools in parallel when appropriate.
This allows the model to gather information from different sources
simultaneously.The model intelligently determines when parallel execution is
appropriate based on the independence of the requested operations.
Parallel tool calls
Streaming tool calls
Streaming tool calls
When streaming responses, tool calls are progressively built through
You can accumulate chunks to build complete tool calls:
ToolCallChunk
. This allows you to see tool calls as they’re being
generated rather than waiting for the complete response.Streaming tool calls
Accumulate tool calls
Structured outputs
Models can be requested to provide their response in a format matching a given schema. This is useful for ensuring the output can be easily parsed and used in subsequent processing. LangChain supports multiple schema types and methods for enforcing structured outputs.Pydantic models
provide the richest feature set with field validation, descriptions, and nested structures.
Key considerations for structured outputs:
- Method parameter: Some providers support different methods
(
'json_schema'
,'function_calling'
,'json_mode'
) - Include raw: Use
include_raw=True
to get both the parsed output and the raw AI message - Validation: Pydantic models provide automatic validation, while
TypedDict
and JSON Schema require manual validation
Example: Message output alongside parsed structure
Example: Message output alongside parsed structure
It can be useful to return the raw
AIMessage
object alongside the parsed
representation to access response metadata such as token counts:Example: Nested structures
Example: Nested structures
Supported models
LangChain supports all major model providers, including OpenAI, Anthropic, Google, Azure, AWS Bedrock, and more. Each provider offers a variety of models with different capabilities. For a full list of supported models in LangChain, see the integrations page.Advanced configuration
Multimodal
Certain models can process and return non-textual data such as images, audio, and video. You can pass non-textual data to a model by providing content blocks.All LangChain chat models with underlying multimodal capabilities support:
- Data in the cross-provider standard format (shown below)
- OpenAI chat completions format
- Any format that is native to that specific provider (e.g., Anthropic models accept Anthropic native format)
AIMessage
will have content blocks with multimodal types.
Multimodal output
Reasoning
Newer models are capable of performing multi-step reasoning to arrive at a conclusion. This involves breaking down complex problems into smaller, more manageable steps. If supported by the underlying model, you can surface this reasoning process to better understand how the model arrived at its final answer.'low'
or 'high'
) or integer token budgets.
For details, see the relevant chat model in the
integrations page.
Local models
LangChain supports running models locally on your own hardware. This is useful for scenarios where data privacy is critical, or when you want to avoid the cost of using a cloud-based model. Ollama is one of the easiest ways to run models locally. See the full list of local integrations on the integrations page.Caching
Chat model APIs can be slow and expensive to call. To help mitigate this, LangChain provides an optional caching layer for chat model integrations.Enable caching for your model
Enable caching for your model
By default, caching is disabled. To enable it, import:Next, choose a cache:
In Memory Cache
In Memory Cache
An ephemeral cache that stores model calls in memory. Wiped when
your environment restarts. Not shared across processes.
InMemoryCache
SQLite Cache
SQLite Cache
Uses a SQLite database to store responses, and will last across
process restarts
SQLite Cache
Rate limiting
Many chat model providers impose a limit on the number of invocations that can be made in a given time period. If you hit a rate limit, you will typically receive a rate limit error response from the provider, and will need to wait before making more requests. To help manage rate limits, chat model integrations accept arate_limiter
parameter that can be provided during initialization to control the rate at
which requests are made.
Initialize and use a rate limiter
Initialize and use a rate limiter
LangChain in comes with a built-in in memory rate limiter. This rate limiter
is thread safe and can be shared by multiple threads in the same process.
Define a rate limiter
The provided rate limiter can only limit the number of requests per unit
time. It will not help if you need to also limit based on the size of
the requests.
Base URL or proxy
For many chat model integrations, you can configure the base URL for API requests, which allows you to use model providers that have OpenAI-compatible APIs or to use a proxy server.Base URL
Base URL
Many model providers offer OpenAI-compatible APIs (e.g.,
Together AI,
vLLM). You can use
init_chat_model
with these providers by specifying the appropriate base_url
parameter:When using direct chat model class instantiation, the parameter name may
vary by provider. Check the respective reference for details.
Proxy configuration
Proxy configuration
For deployments requiring HTTP proxies, some model integrations support proxy
configuration:
Proxy support varies by integration. Check the specific model provider’s
reference for proxy configuration options.
Log probabilities
Certain models can be configured to return token-level log probabilities representing the likelihood of a given token. Accessing them is as simple as setting thelogprobs
parameter when initializing a model:
Log probs
Token usage
A number of model providers return token usage information as part of the invocation response. When available, this information will be included on theAIMessage
objects produced by the corresponding model. For more details, see
the messages guide.
Some provider APIs, notably OpenAI and Azure OpenAI chat completions, require
users opt-in to receiving token usage data in streaming contexts. See
this section of the
integration guide for details.
Callback handler
Invocation config
When invoking a model, you can pass additional configuration through theconfig
parameter using a RunnableConfig
dictionary. This provides run-time control over execution behavior, callbacks, and metadata tracking.
Common configuration options include:
Invocation with config
Key configuration attributes
Key configuration attributes
Identifies this specific invocation in logs and traces. Not inherited by
sub-calls.
Labels inherited by all sub-calls for filtering and organization in
debugging tools.
Custom key-value pairs for tracking additional context, inherited by all
sub-calls.
Controls the maximum number of parallel calls when using
batch()
or
batch_as_completed()
.Handlers for monitoring and responding to events during execution. See
callbacks for details.
Maximum recursion depth for chains to prevent infinite loops in complex
pipelines.
- Debugging with LangSmith tracing
- Implementing custom logging or monitoring
- Controlling resource usage in production
- Tracking invocations across complex pipelines
RunnableConfig
attributes, see the
RunnableConfig
reference.
Configurable models
You can also create a runtime-configurable model by specifyingconfigurable_fields
. If you don’t specify a model value, then 'model'
and
'model_provider'
will be configurable by default.
Configurable model with default values
Configurable model with default values
We can create a configurable model with default model values, specify which
parameters are configurable, and add prefixes to configurable params:
Using a configurable model declaratively
Using a configurable model declaratively
We can call declarative operations like
bind_tools
, with_structured_output
,
with_configurable
, etc. on a configurable model and chain a configurable model
in the same way that we would a regularly instantiated chat model object.