Models

You are viewing the v1 docs for LangChain, which is currently under active development. Learn more.

Overview

LLMs are machine learning models that can interpret and generate text like humans. They’re versatile enough to write content, translate languages, summarize, and answer questions without needing special training for each task. In addition to text generation, many models support:

Tool calling - where models call external tools (like databases queries or API calls) and use results in their responses.
Structured output - where the model’s response is constrained match a schema.
Multimodal - where models can process and return data other than text, such as images, audio, and video.
Reasoning - where models are able to perform multi-step reasoning to arrive at a conclusion.

Basic usage

The easiest way to get started with a model in LangChain is to use init_chat_model to initialize one from a provider of your choice:

from langchain.chat_models import init_chat_model

model = init_chat_model("openai:gpt-5-nano")
response = model.invoke("Why do parrots talk?")

See init_chat_model for more detail.

Key methods

Invoke

The model takes messages as input and returns messages after generating a full response.

Stream

Invoke the model, but stream the response as it is generated in real-time.

Batch

Send multiple requests to a model in a batch for more efficient processing.

In addition to chat models, LangChain provides support for other adjacent technologies, such as embedding models and vector stores. See the integrations page for details.

Parameters

A chat model takes parameters that can be used to configure its behavior. The full set of supported parameters vary by model and provider, but common ones include:

model

string

required

The name or identifier of the specific model you want to use with a provider.

api_key

string

The key required for authenticating with the model’s provider. This is usually issued when you sign up for access to the model. Can often be accessed by setting an .

temperature

number

Controls the randomness of the model’s output. A higher number makes responses more creative, while a lower one makes them more deterministic.

stop

string[]

A sequence of characters that indicates when the model should stop generating its output. String or list of strings.

timeout

number

The maximum time (in seconds) to wait for a response from the model before canceling the request.

max_tokens

number

Limits the total number of in the response, effectively controlling how long the output can be.

max_retries

number

The maximum number of attempts the system will make to resend a request if it fails due to issues like network timeouts or rate limits.

To find all the parameters supported by a given chat model, head to the reference docs.

Invocation

A chat model must be invoked to generate an output. There are three main invocation methods, each suited to different use cases.

Each invocation method has an equivalent, typically prefixed with the letter 'a'For example: ainvoke(), astream(), abatch()A full list of async methods can be found in the reference docs.

Invoke

The most straightforward way to call a model is to use invoke() with a single message or a list of messages.

Single message

response = model.invoke("Why do parrots have colorful feathers?")
print(response)

A list of messages can be provided to a model to represent conversation history. Each message has a role that models use to indicate who sent the message in the conversation. See the messages guide for more detail on roles, types, and content.

Conversation history

from langchain.messages import HumanMessage, AIMessage, SystemMessage

conversation = [
    SystemMessage("You are a helpful assistant that translates English to French."),
    HumanMessage("Translate: I love programming."),
    AIMessage("J'adore la programmation."),
    HumanMessage("Translate: I love building applications.")
]

response = model.invoke(conversation)
print(response)  # AIMessage("J'adore créer des applications.")

Stream

Most models can stream their output content while it is being generated. By displaying output progressively, streaming significantly improves user experience, particularly for longer responses. Calling stream() returns an that yields output chunks as they are produced. You can use a loop to process each chunk in real-time:

for chunk in model.stream("Why do parrots have colorful feathers?"):
            print(chunk.text, end="|", flush=True)

As opposed to invoke(), which returns a single @[AIMessage][AIMessage] after the model has finished generating its full response, stream() returns multiple @[AIMessageChunk][AIMessageChunk] objects, each containing a portion of the output text. Importantly, each AIMessageChunk in a stream is designed to be gathered into a full message via summation:

Construct AIMessage

full = None  # None | AIMessageChunk
for chunk in model.stream("What color is the sky?"):
        full = chunk if full is None else full + chunk
    print(full.text)

# The
# The sky
# The sky is
# The sky is typically
# The sky is typically blue
# ...

print(full.content_blocks)
# [{"type": "text", "text": "The sky is typically blue..."}]

The resulting message can be treated the same as a message that was generated with invoke() - for example, it can be aggregated into a message history and passed back to the model as conversational context.

Streaming only works if all steps in the program know how to process an stream of chunks. For instance, an application that isn’t streaming-capable would be one that needs to store the entire output in memory before it can be processed.

Advanced streaming topics

"Auto-Streaming" Chat Models

Streaming events

LangChain chat models can also stream semantic events using astream_events().This simplifies filtering based on event types and other metadata, and will aggregate the full message in the background. See below for an example.

async for event in model.astream_events("Hello"):

    if event["event"] == "on_chat_model_start":
        print(f"Input: {event['data']['input']}")

    elif event["event"] == "on_chat_model_stream":
        print(f"Token: {event['data']['chunk'].text}")

    elif event["event"] == "on_chat_model_end":
        print(f"Full message: {event['data']['output'].text}")

    else:
        pass

Input: Hello
Token: Hi
Token:  there
Token: !
Token:  How
Token:  can
Token:  I
...
Full message: Hi there! How can I help today?

See the @[astream_events][astream_events()] reference for event types and other details.

Batch

This section describes a chat model method batch(), which parallelizes model calls client-side. It is distinct from batch APIs supported by inference providers.

Batching a collection of independent requests to a model can significantly improve performance, as the processing can be done in parallel:

Batch

responses = model.batch([
    "Why do parrots have colorful feathers?",
    "How do airplanes fly?",
    "What is quantum computing?"
])
for response in responses:
    print(response)

By default, batch() will only return the final output for the entire batch. If you want to receive the output for each individual input as it is finishes generating, you can stream results with batch_as_completed():

Yield responses upon completion

for response in model.batch_as_completed([
    "Why do parrots have colorful feathers?",
    "How do airplanes fly?",
    "What is quantum computing?"
]):
    print(response)

When using batch_as_completed(), results may arrive out of order. Each includes the input index for matching to reconstruct the original order if needed.

When processing a large number of inputs using batch() or batch_as_completed(), you may want to control the maximum number of parallel calls. This can be done by setting the max_concurrency attribute in the RunnableConfig dictionary.

Batch with max concurrency

model.batch(
list_of_inputs,
config={
'max_concurrency': 5,  # Limit to 5 parallel calls
}
)

See the RunnableConfig reference for a full list of supported attributes.

For more details on batching, see the reference.

Tool calling

Models can request to call tools that perform tasks such as fetching data from a database, searching the web, or running code. Tools are pairings of:

A schema, including the name of the tool, a description, and/or argument definitions (often a JSON schema)
A function or to execute.

You may hear the term function calling. We use this interchangeably with tool calling.

To make tools that you have defined available for use by a model, you must bind them using bindTools(). In subsequent invocations, the model can choose to call any of the bound tools as needed. Some model providers offer built-in tools that can be enabled via model parameters. Check the respective provider reference for details.

See the tools guide for details and other options for creating tools.

Binding user tools

from langchain_core.tools import tool

@tool
def get_weather(location: str) -> str:
    """Get the weather at a location."""
    return f"It's sunny in {location}."


model_with_tools = model.bind_tools([get_weather])

response = model_with_tools.invoke("What's the weather like in Boston?")
for tool_call in response.tool_calls:
    # View tool calls made by the model
    print(f"Tool: {tool_call['name']}")
    print(f"Args: {tool_call['args']}")

When binding user-defined tools, the model’s response includes a request to execute a tool. It is up to you to perform the requested action and return the result back to the model for use in subsequent reasoning. Below, we show some common ways you can get use tool calling.

Tool execution loop

When a model returns tool calls, you need to execute the tools and pass the results back to the model. This creates a conversation loop where the model can use tool results to generate its final response.

Tool execution loop

# Bind (potentially multiple) tools to the model
        model_with_tools = model.bind_tools([get_weather])

# Step 1: Model generates tool calls
        messages = [{"role": "user", "content": "What's the weather in Boston?"}]
        ai_msg = model_with_tools.invoke(messages)
messages.append(ai_msg)

# Step 2: Execute tools and collect results
for tool_call in ai_msg.tool_calls:
    # Execute the tool with the generated arguments
                tool_result = get_weather.invoke(tool_call)
    messages.append(tool_result)

# Step 3: Pass results back to model for final response
        final_response = model_with_tools.invoke(messages)
print(final_response.text)
# "The current weather in Boston is 72°F and sunny."

Each ToolMessage returned by the tool includes a tool_call_id that matches the original tool call, helping the model correlate results with requests.

Forcing tool calls

By default, the model has the freedom to choose which bound tool to use based on the user’s input. However, you might want to force choosing a tool, ensuring the model uses either a particular tool or any tool from a given list:

        model_with_tools = model.bind_tools([tool_1], tool_choice="any")

Parallel tool calls

Many models support calling multiple tools in parallel when appropriate. This allows the model to gather information from different sources simultaneously.

Parallel tool calls

        model_with_tools = model.bind_tools([get_weather])

        response = model_with_tools.invoke(
    "What's the weather in Boston and Tokyo?"
)

# The model may generate multiple tool calls
print(response.tool_calls)
# [
#   {'name': 'get_weather', 'args': {'location': 'Boston'}, 'id': 'call_1'},
#   {'name': 'get_time', 'args': {'location': 'Tokyo'}, 'id': 'call_2'}
# ]

# Execute all tools (can be done in parallel with async)
        results = []
for tool_call in response.tool_calls:
                if tool_call['name'] == 'get_weather':
                        result = get_weather.invoke(tool_call)
    ...
    results.append(result)

The model intelligently determines when parallel execution is appropriate based on the independence of the requested operations.

Streaming tool calls

When streaming responses, tool calls are progressively built through ToolCallChunk. This allows you to see tool calls as they’re being generated rather than waiting for the complete response.

Streaming tool calls

for chunk in model_with_tools.stream(
    "What's the weather in Boston and Tokyo?"
):
    # Tool call chunks arrive progressively
    if chunk.tool_call_chunks:
        for tool_chunk in chunk.tool_call_chunks:
            print(f"Tool: {tool_chunk.get('name', '')}")
            print(f"Args: {tool_chunk.get('args', '')}")

# Output:
# Tool: get_weather            # Loop 1
# Args:
# Tool:                        # Loop 2
# Args: {"loc
# Tool:                        # Loop 3
# Args: ation": "BOS"}
# Tool: get_time               # Loop 4
# Args:
# Tool:                        # Loop 5
# Args: {"timezone": "Tokyo"}

You can accumulate chunks to build complete tool calls:

Accumulate tool calls

        gathered = None
for chunk in model_with_tools.stream("What's the weather in Boston?"):
                gathered = chunk if gathered is None else gathered + chunk
    print(gathered.content_blocks)

Structured outputs

Models can be requested to provide their response in a format matching a given schema. This is useful for ensuring the output can be easily parsed and used in subsequent processing. LangChain supports multiple schema types and methods for enforcing structured outputs.

Pydantic models provide the richest feature set with field validation, descriptions, and nested structures.

from pydantic import BaseModel, Field

class Movie(BaseModel):
    """A movie with details."""
                title: str = Field(..., description="The title of the movie")
                year: int = Field(..., description="The year the movie was released")
                director: str = Field(..., description="The director of the movie")
                rating: float = Field(..., description="The movie's rating out of 10")

        model_with_structure = model.with_structured_output(Movie)
        response = model_with_structure.invoke("Provide details about the movie Inception")
        print(response)  # Movie(title="Inception", year=2010, director="Christopher Nolan", rating=8.8)

Key considerations for structured outputs:

Method parameter: Some providers support different methods ('json_schema', 'function_calling', 'json_mode')
Include raw: Use include_raw=True to get both the parsed output and the raw AI message
Validation: Pydantic models provide automatic validation, while TypedDict and JSON Schema require manual validation

Example: Message output alongside parsed structure

It can be useful to return the raw AIMessage object alongside the parsed representation to access response metadata such as token counts:

from pydantic import BaseModel, Field

class Movie(BaseModel):
    """A movie with details."""
        title: str = Field(..., description="The title of the movie")
        year: int = Field(..., description="The year the movie was released")
        director: str = Field(..., description="The director of the movie")
        rating: float = Field(..., description="The movie's rating out of 10")

model_with_structure = model.with_structured_output(Movie, include_raw=True)
response = model_with_structure.invoke("Provide details about the movie Inception")
response
# {
#     "raw": AIMessage(...),
#     "parsed": Movie(title=..., year=..., ...),
#     "parsing_error": None,
# }

Example: Nested structures

from pydantic import BaseModel, Field

class Actor(BaseModel):
    name: str
    role: str

class MovieDetails(BaseModel):
    title: str
    year: int
    cast: list[Actor]
    genres: list[str]
            budget: float | None = Field(None, description="Budget in millions USD")

    model_with_structure = model.with_structured_output(MovieDetails)

Supported models

LangChain supports all major model providers, including OpenAI, Anthropic, Google, Azure, AWS Bedrock, and more. Each provider offers a variety of models with different capabilities. For a full list of supported models in LangChain, see the integrations page.

Advanced configuration

Multimodal

Certain models can process and return non-textual data such as images, audio, and video. You can pass non-textual data to a model by providing content blocks.

All LangChain chat models with underlying multimodal capabilities support:

Data in the cross-provider standard format (shown below)
OpenAI chat completions format
Any format that is native to that specific provider (e.g., Anthropic models accept Anthropic native format)

# From URL
    response = model.invoke([
    {"type": "text", "text": "Describe the content of this image."},
    {"type": "image", "url": "https://example.com/path/to/image.jpg"},
])

# From base64 data
    response = model.invoke([
    {"type": "text", "text": "Describe the content of this image."},
    {
        "type": "image",
        "base64": "AAAAIGZ0eXBtcDQyAAAAAGlzb21tcDQyAAACAGlzb2...",
        "mime_type": "image/jpeg",
    },
])

# From provider-managed File ID
    response = model.invoke([
    {"type": "text", "text": "Describe the content of this image."},
    {"type": "image", "file_id": "file-abc123"},
])

Some models can also return multimodal data as part of their response. In such cases, the resulting AIMessage will have content blocks with multimodal types.

Multimodal output

response = model.invoke("Create a picture of a cat")
print(response.content_blocks)
# [
#     {"type": "text", "text": "Here's a picture of a cat"},
#     {"type": "image", "base64": "...", "mime_type": "image/jpeg"},
# ]

See the integrations page for details on specific providers.

Reasoning

Newer models are capable of performing multi-step reasoning to arrive at a conclusion. This involves breaking down complex problems into smaller, more manageable steps. If supported by the underlying model, you can surface this reasoning process to better understand how the model arrived at its final answer.

for chunk in model.stream("Why do parrots have colorful feathers?"):
            reasoning_steps = [r for r in chunk.content_blocks if r["type"] == "reasoning"]
    print(reasoning_steps if reasoning_steps else chunk.text)

Depending on the model, you can sometimes specify the level of effort it should put into reasoning. Alternatively, you can request that the model turn off reasoning entirely. This may take the form of categorical “tiers” of reasoning (e.g., 'low' or 'high') or integer token budgets. For details, see the relevant chat model in the integrations page.

Local models

LangChain supports running models locally on your own hardware. This is useful for scenarios where data privacy is critical, or when you want to avoid the cost of using a cloud-based model. Ollama is one of the easiest ways to run models locally. See the full list of local integrations on the integrations page.

Caching

Chat model APIs can be slow and expensive to call. To help mitigate this, LangChain provides an optional caching layer for chat model integrations.

Enable caching for your model

By default, caching is disabled. To enable it, import:

from langchain_core.globals import set_llm_cache

Next, choose a cache:

In Memory Cache

An ephemeral cache that stores model calls in memory. Wiped when your environment restarts. Not shared across processes.

InMemoryCache

from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

            response = model.invoke("Tell me a joke")
            response = model.invoke("Tell me a joke")  # Fast, from cache

SQLite Cache

Uses a SQLite database to store responses, and will last across process restarts

SQLite Cache

# We can do the same thing with a SQLite cache
from langchain_community.cache import SQLiteCache

            set_llm_cache(SQLiteCache(database_path=".langchain.db"))

            response = model.invoke("Tell me a joke")
            response = model.invoke("Tell me a joke")  # Fast, from cache

Rate limiting

Many chat model providers impose a limit on the number of invocations that can be made in a given time period. If you hit a rate limit, you will typically receive a rate limit error response from the provider, and will need to wait before making more requests. To help manage rate limits, chat model integrations accept a rate_limiter parameter that can be provided during initialization to control the rate at which requests are made.

Initialize and use a rate limiter

LangChain in comes with a built-in in memory rate limiter. This rate limiter is thread safe and can be shared by multiple threads in the same process.

Define a rate limiter

    from langchain_core.rate_limiters import InMemoryRateLimiter

            rate_limiter = InMemoryRateLimiter(
                    requests_per_second=0.1,  # 1 request every 10s
                    check_every_n_seconds=0.1,  # Check every 100ms whether allowed to make a request
                    max_bucket_size=10,  # Controls the maximum burst size.
    )

            model = init_chat_model(
                    model="gpt-5",
                    model_provider="openai",
                    rate_limiter=rate_limiter
    )

The provided rate limiter can only limit the number of requests per unit time. It will not help if you need to also limit based on the size of the requests.

Base URL or proxy

For many chat model integrations, you can configure the base URL for API requests, which allows you to use model providers that have OpenAI-compatible APIs or to use a proxy server.

Base URL

Many model providers offer OpenAI-compatible APIs (e.g., Together AI, vLLM). You can use init_chat_model with these providers by specifying the appropriate base_url parameter:

model = init_chat_model(
        model="MODEL_NAME",
        model_provider="openai",
        base_url="BASE_URL",
        api_key="YOUR_API_KEY",
)

When using direct chat model class instantiation, the parameter name may vary by provider. Check the respective reference for details.

Proxy configuration

For deployments requiring HTTP proxies, some model integrations support proxy configuration:

from langchain_openai import ChatOpenAI

model = ChatOpenAI(
        model="gpt-4o",
        openai_proxy="http://proxy.example.com:8080"
)

Proxy support varies by integration. Check the specific model provider’s reference for proxy configuration options.

Log probabilities

Certain models can be configured to return token-level log probabilities representing the likelihood of a given token. Accessing them is as simple as setting the logprobs parameter when initializing a model:

Log probs

model = init_chat_model(
        model="gpt-4o",
        model_provider="openai"
).bind(logprobs=True)

response = model.invoke("Why do parrots talk?")
print(response.response_metadata["logprobs"])

Token usage

A number of model providers return token usage information as part of the invocation response. When available, this information will be included on the AIMessage objects produced by the corresponding model. For more details, see the messages guide.

Some provider APIs, notably OpenAI and Azure OpenAI chat completions, require users opt-in to receiving token usage data in streaming contexts. See this section of the integration guide for details.

You can track aggregate token counts across models in an application using either a callback or context manager, as shown below:

Callback handler

from langchain.chat_models import init_chat_model
from langchain_core.callbacks import UsageMetadataCallbackHandler

llm_1 = init_chat_model(model="openai:gpt-4o-mini")
llm_2 = init_chat_model(model="anthropic:claude-3-5-haiku-latest")

callback = UsageMetadataCallbackHandler()
result_1 = llm_1.invoke("Hello", config={"callbacks": [callback]})
result_2 = llm_2.invoke("Hello", config={"callbacks": [callback]})
callback.usage_metadata

{
    'gpt-4o-mini-2024-07-18': {
        'input_tokens': 8,
        'output_tokens': 10,
        'total_tokens': 18,
        'input_token_details': {'audio': 0, 'cache_read': 0},
        'output_token_details': {'audio': 0, 'reasoning': 0}},
        'claude-3-5-haiku-20241022': {'input_tokens': 8,
        'output_tokens': 21,
        'total_tokens': 29,
        'input_token_details': {'cache_read': 0, 'cache_creation': 0}
    }
}

Invocation config

When invoking a model, you can pass additional configuration through the config parameter using a RunnableConfig dictionary. This provides run-time control over execution behavior, callbacks, and metadata tracking. Common configuration options include:

Invocation with config

response = model.invoke(
    "Tell me a joke",
        config={
        "run_name": "joke_generation",      # Custom name for this run
        "tags": ["humor", "demo"],          # Tags for categorization
        "metadata": {"user_id": "123"},     # Custom metadata
        "callbacks": [my_callback_handler], # Callback handlers
    }
)

Key configuration attributes

These configuration values are particularly useful when:

Debugging with LangSmith tracing
Implementing custom logging or monitoring
Controlling resource usage in production
Tracking invocations across complex pipelines

For more information on all supported RunnableConfig attributes, see the RunnableConfig reference.

Configurable models

You can also create a runtime-configurable model by specifying configurable_fields. If you don’t specify a model value, then 'model' and 'model_provider' will be configurable by default.

from langchain.chat_models import init_chat_model

configurable_model = init_chat_model(temperature=0)

configurable_model.invoke(
    "what's your name",
        config={"configurable": {"model": "gpt-5-nano"}},  # Run with GPT-5-Nano
)
configurable_model.invoke(
    "what's your name",
        config={"configurable": {"model": "claude-3-5-sonnet-latest"}},  # Run with Claude
)

Configurable model with default values

We can create a configurable model with default model values, specify which parameters are configurable, and add prefixes to configurable params:

first_llm = init_chat_model(
        model="gpt-4.1-mini",
        temperature=0,
        configurable_fields=("model", "model_provider", "temperature", "max_tokens"),
        config_prefix="first",  # Useful when you have a chain with multiple models
)

first_llm.invoke("what's your name")

first_llm.invoke(
    "what's your name",
        config={
        "configurable": {
            "first_model": "claude-3-5-sonnet-latest",
            "first_temperature": 0.5,
            "first_max_tokens": 100,
        }
    },
)

Using a configurable model declaratively

We can call declarative operations like bind_tools, with_structured_output, with_configurable, etc. on a configurable model and chain a configurable model in the same way that we would a regularly instantiated chat model object.

from pydantic import BaseModel, Field


class GetWeather(BaseModel):
    """Get the current weather in a given location"""

        location: str = Field(..., description="The city and state, e.g. San Francisco, CA")


class GetPopulation(BaseModel):
    """Get the current population in a given location"""

        location: str = Field(..., description="The city and state, e.g. San Francisco, CA")


llm = init_chat_model(temperature=0)
llm_with_tools = llm.bind_tools([GetWeather, GetPopulation])

llm_with_tools.invoke(
        "what's bigger in 2024 LA or NYC", config={"configurable": {"model": "gpt-4.1-mini"}}
).tool_calls

[
    {
        'name': 'GetPopulation',
        'args': {'location': 'Los Angeles, CA'},
        'id': 'call_Ga9m8FAArIyEjItHmztPYA22',
        'type': 'tool_call'
    },
    {
        'name': 'GetPopulation',
        'args': {'location': 'New York, NY'},
        'id': 'call_jh2dEvBaAHRaw5JUDthOs7rt',
        'type': 'tool_call'
    }
]

llm_with_tools.invoke(
    "what's bigger in 2024 LA or NYC",
        config={"configurable": {"model": "claude-3-5-sonnet-latest"}},
).tool_calls

[
    {
        'name': 'GetPopulation',
        'args': {'location': 'Los Angeles, CA'},
        'id': 'toolu_01JMufPf4F4t2zLj7miFeqXp',
        'type': 'tool_call'
    },
    {
        'name': 'GetPopulation',
        'args': {'location': 'New York City, NY'},
        'id': 'toolu_01RQBHcE8kEEbYTuuS8WqY1u',
        'type': 'tool_call'
    }
]

Get started

Core concepts

Advanced

Production

Overview

Basic usage

Key methods

Invoke

Stream

Batch

Parameters

Invocation

Invoke

Stream

How it works

Batch

Tool calling

Structured outputs

Supported models

Advanced configuration

Multimodal

Reasoning

Local models

Caching

Rate limiting

Base URL or proxy

Log probabilities

Token usage

Invocation config

Configurable models

Get started

Core concepts

Advanced

Production

​Overview

​Basic usage

​Key methods

Invoke

Stream

Batch

​Parameters

​Invocation

​Invoke

​Stream

​Batch

​Tool calling

​Structured outputs

​Supported models

​Advanced configuration

​Multimodal

​Reasoning

​Local models

​Caching

​Rate limiting

​Base URL or proxy

​Log probabilities

​Token usage

​Invocation config

​Configurable models

Overview

Basic usage

Key methods

Parameters

Invocation

Invoke

Stream

Batch

Tool calling

Structured outputs

Supported models

Advanced configuration

Multimodal

Reasoning

Local models

Caching

Rate limiting

Base URL or proxy

Log probabilities

Token usage

Invocation config

Configurable models