Reduce Cost and Latency with Amazon Bedrock’s Intelligent Fast Routing and Fast Caching (Preview) | Amazon Web Services

Reduce Cost and Latency with Amazon Bedrock's Intelligent Fast Routing and Fast Caching (Preview) | Amazon Web Services

Voiced by Polly

December 5, 2024: Added instructions to request access to the Amazon Bedrock prompt caching preview.

Amazon Bedrock today previewed two features that help reduce the cost and latency of generative AI applications:

Amazon Bedrock Smart Fast Routing – When recalling a model, you can now use a combination of base models (FM) from the same model family to help you optimize quality and price. For example, with Anthropic’s Claude family of models, Amazon Bedrock can intelligently route requests between the Claude 3.5 Sonnet and the Claude 3 Haiku depending on the complexity of the challenge. Similarly, Amazon Bedrock can route requests between Meta Llama 3.1 70B and 8B. The fast router predicts which model will provide the best performance for each request while optimizing response quality and cost. This is particularly useful for applications such as customer service assistants, where uncomplicated queries can be handled by smaller, faster and more cost-effective models, and complex queries are routed to more capable models. Intelligent fast routing can reduce costs by up to 30 percent without compromising accuracy.

Amazon Bedrock now supports fast caching – You can now cache frequently used context in multiple model invocations. This is especially valuable for applications that reuse the same context, such as document Q&A systems where users ask multiple questions about the same document, or coding assistants who need to maintain context about code files. The cached context remains available for up to 5 minutes after each access. Fast caching in Amazon Bedrock can reduce costs by up to 90% and latency by up to 85% for supported models.

These features make it easy to reduce latency and balance performance with cost efficiency. Let’s see how you can use them in your applications.

Using Amazon Bedrock Intelligent Prompt Routing in the console
Amazon Bedrock Intelligent Prompt Routing uses advanced fast matching techniques and model understanding to predict the performance of each model for each request, optimizing response quality and cost. During the preview, you can use the default challenge routers for Anthropic’s Claude and Meta Llama model families.

Intelligent Call Routing can be accessed through the AWS Management Console, the AWS Command Line Interface (AWS CLI), and the AWS SDKs. In the Amazon Bedrock console, I select Ready routers in Foundation models part of the navigation panel.

Screenshot of the console.

I choose Anthropic Prompt Router default router for more information.

Screenshot of the console.

I can see from the fast router configuration that it is routing requests between Claude 3.5 Sonnet and Claude 3 Haiku using inference profiles across regions. Routing criteria defines the difference in quality between the response of the largest model and the smallest model for each challenge, as predicted by the router’s internal runtime model. Anthropic’s Claude 3.5 Sonnet is a fallback model used when none of the selected models meet the required performance criteria.

i choose Open on the field to chat using a prompt router and enter this prompt:

Alice has N brothers and she also has M sisters. How many sisters does Alice’s brothers have?

The result will come quickly. I choose a new one Router metrics icon on the right, you will see which model was selected by the prompting router. In this case, since the question is quite complex, Claude 3.5 Sonnet from Anthropic was used.

Screenshot of the console.

Now I will ask a direct question to the same fast router:

Describe the purpose of a 'hello world' program in one line.

This time, Anthropic’s Claude 3 Haiku was chosen as the fast router.

Screenshot of the console.

i choose Meta Prompt Router check its configuration. Uses cross-region inference profiles for Llama 3.1 70B and 8B with the 70B model as a fallback.

Screenshot of the console.

Prompt routers are integrated with other Amazon Bedrock features, such as the Amazon Bedrock Knowledge Base and Amazon Bedrock Agents, or when performing assessments. For example, here I will create a model rating to help me compare a fast router for my use case against another model or fast router.

Screenshot of the console.

In order to use the prompt router in my application, I need to set the prompt router Amazon Resource Name (ARN) as the model ID in the Amazon Bedrock API. Let’s see how it works with the AWS CLI and AWS SDK.

Using Amazon Bedrock Smart Fast Routing with the AWS CLI
The Amazon Bedrock API has been extended to handle fast routers. For example, I can list existing call routes in an AWS region using ListPromptRouters:

aws bedrock list-prompt-routers

In the output I get a summary of the existing fast routers, similar to what I saw in the console.

Here is the full output of the previous command:

{
    "promptRouterSummaries": (
        {
            "promptRouterName": "Anthropic Prompt Router",
            "routingCriteria": {
                "responseQualityDifference": 0.26
            },
            "description": "Routes requests among models in the Claude family",
            "createdAt": "2024-11-20T00:00:00+00:00",
            "updatedAt": "2024-11-20T00:00:00+00:00",
            "promptRouterArn": "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/anthropic.claude:1",
            "models": (
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-haiku-20240307-v1:0"
                },
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-5-sonnet-20240620-v1:0"
                }
            ),
            "fallbackModel": {
                "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-5-sonnet-20240620-v1:0"
            },
            "status": "AVAILABLE",
            "type": "default"
        },
        {
            "promptRouterName": "Meta Prompt Router",
            "routingCriteria": {
                "responseQualityDifference": 0.0
            },
            "description": "Routes requests among models in the LLaMA family",
            "createdAt": "2024-11-20T00:00:00+00:00",
            "updatedAt": "2024-11-20T00:00:00+00:00",
            "promptRouterArn": "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1",
            "models": (
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-8b-instruct-v1:0"
                },
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
                }
            ),
            "fallbackModel": {
                "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
            },
            "status": "AVAILABLE",
            "type": "default"
        }
    )
}

I can get information about the specific fast router it is using GetPromptRouter with fast router ARN. For example, for the Meta Llama family of models:

aws bedrock get-prompt-router --prompt-router-arn arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1

{
    "promptRouterName": "Meta Prompt Router",
    "routingCriteria": {
        "responseQualityDifference": 0.0
    },
    "description": "Routes requests among models in the LLaMA family",
    "createdAt": "2024-11-20T00:00:00+00:00",
    "updatedAt": "2024-11-20T00:00:00+00:00",
    "promptRouterArn": "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1",
    "models": (
        {
            "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-8b-instruct-v1:0"
        },
        {
            "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
        }
    ),
    "fallbackModel": {
        "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
    },
    "status": "AVAILABLE",
    "type": "default"
}

To use the fast router with Amazon Bedrock, I set the prompt router ARN as the model ID when calling the API. For example, here I am using Anthropic Prompt Router with AWS CLI and Amazon Bedrock Converse API:

aws bedrock-runtime converse \
    --model-id arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/anthropic.claude:1 \
    --messages '({ "role": "user", "content": ( { "text": "Alice has N brothers and she also has M sisters. How many sisters does Alice’s brothers have?" } ) })' \

In the output, the invocation by the router includes the prompt new trace section that says which model was actually used. In this case, it’s Anthropic’s Claude 3.5 Sonnet:

{
    "output": {
        "message": {
            "role": "assistant",
            "content": (
                {
                    "text": "To solve this problem, let's think it through step-by-step:\n\n1) First, we need to understand the relationships:\n   - Alice has N brothers\n   - Alice has M sisters\n\n2) Now, we need to consider who Alice's brothers' sisters are:\n   - Alice herself is a sister to all her brothers\n   - All of Alice's sisters are also sisters to Alice's brothers\n\n3) So, the total number of sisters that Alice's brothers have is:\n   - The number of Alice's sisters (M)\n   - Plus Alice herself (+1)\n\n4) Therefore, the answer can be expressed as: M + 1\n\nThus, Alice's brothers have M + 1 sisters."
                }
            )
        }
    },
    . . .
    "trace": {
        "promptRouter": {
            "invokedModelId": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-5-sonnet-20240620-v1:0"
        }
    }
}

Using Amazon Bedrock Smart Fast Routing with the AWS SDK
Using the AWS SDK with Fast Router is similar to using the command line previously. When invoking the model, I set the model ID to the model invocation ARN. For example, in this Python code, I’m using the Meta Llama s router ConverseStream API:

import json
import boto3

bedrock_runtime = boto3.client(
    "bedrock-runtime",
    region_name="us-east-1",
)

MODEL_ID = "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1"

user_message = "Describe the purpose of a 'hello world' program in one line."
messages = (
    {
        "role": "user",
        "content": ({"text": user_message}),
    }
)

streaming_response = bedrock_runtime.converse_stream(
    modelId=MODEL_ID,
    messages=messages,
)

for chunk in streaming_response("stream"):
    if "contentBlockDelta" in chunk:
        text = chunk("contentBlockDelta")("delta")("text")
        print(text, end="")
    if "messageStop" in chunk:
        print()
    if "metadata" in chunk:
        if "trace" in chunk("metadata"):
            print(json.dumps(chunk('metadata')('trace'), indent=2))

This script prints the response text and trace content in the response metadata. For this uncomplicated request, a faster and more affordable model was chosen by the fast router:

A "Hello World" program is a simple, introductory program that serves as a basic example to demonstrate the fundamental syntax and functionality of a programming language, typically used to verify that a development environment is set up correctly.
{
  "promptRouter": {
    "invokedModelId": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-8b-instruct-v1:0"
  }
}

Using Fast Caching with the AWS SDK
You can use prompt caching with the Amazon Bedrock Converse API. When you mark content for caching and send it to the model for the first time, the model processes the input and caches the intermediate results. For subsequent requests containing the same content, the model fetches preprocessed results from the cache, significantly reducing both cost and latency.

You can implement fast caching in your applications using a few steps:

  1. Identify parts of prompts that are often reused.
  2. Mark these sections for caching in the message list with a new cachePoint block.
  3. Monitor cache usage and latency improvements in response metadata usage section.

Here is an example of implementing fast caching when working with documents.

First, I download three PDF decision guides from the AWS website. These guides help you choose the AWS services that fit your use case.

I then use a Python script to ask three questions about the documents. In the code, I create a converse() function to handle the conversation with the model. When the function is first called, I include a list of documents and a flag to add a cachePoint block.

import json

import boto3

MODEL_ID = "us.anthropic.claude-3-5-sonnet-20241022-v2:0"
AWS_REGION = "us-west-2"

bedrock_runtime = boto3.client(
    "bedrock-runtime",
    region_name=AWS_REGION,
)

DOCS = (
    "bedrock-or-sagemaker.pdf",
    "generative-ai-on-aws-how-to-choose.pdf",
    "machine-learning-on-aws-how-to-choose.pdf",
)

messages = ()


def converse(new_message, docs=(), cache=False):

    if len(messages) == 0 or messages(-1)("role") != "user":
        messages.append({"role": "user", "content": ()})

    for doc in docs:
        print(f"Adding document: {doc}")
        name, format = doc.rsplit('.', maxsplit=1)
        with open(doc, "rb") as f:
            bytes = f.read()
        messages(-1)("content").append({
            "document": {
                "name": name,
                "format": format,
                "source": {"bytes": bytes},
            }
        })

    messages(-1)("content").append({"text": new_message})

    if cache:
        messages(-1)("content").append({"cachePoint": {"type": "default"}})

    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=messages,
    )

    output_message = response("output")("message")
    response_text = output_message("content")(0)("text")

    print("Response text:")
    print(response_text)

    print("Usage:")
    print(json.dumps(response("usage"), indent=2))

    messages.append(output_message)


converse("Compare AWS Trainium and AWS Inferentia in 20 words or less.", docs=DOCS, cache=True)
converse("Compare Amazon Textract and Amazon Transcribe in 20 words or less.")
converse("Compare Amazon Q Business and Amazon Q Developer in 20 words or less.")

For each invocation, the script prints the response a usage counters.

Adding document: bedrock-or-sagemaker.pdf
Adding document: generative-ai-on-aws-how-to-choose.pdf
Adding document: machine-learning-on-aws-how-to-choose.pdf
Response text:
AWS Trainium is optimized for machine learning training, while AWS Inferentia is designed for low-cost, high-performance machine learning inference.
Usage:
{
  "inputTokens": 4,
  "outputTokens": 34,
  "totalTokens": 29879,
  "cacheReadInputTokenCount": 0,
  "cacheWriteInputTokenCount": 29841
}
Response text:
Amazon Textract extracts text and data from documents, while Amazon Transcribe converts speech to text from audio or video files.
Usage:
{
  "inputTokens": 59,
  "outputTokens": 30,
  "totalTokens": 29930,
  "cacheReadInputTokenCount": 29841,
  "cacheWriteInputTokenCount": 0
}
Response text:
Amazon Q Business answers questions using enterprise data, while Amazon Q Developer assists with building and operating AWS applications and services.
Usage:
{
  "inputTokens": 108,
  "outputTokens": 26,
  "totalTokens": 29975,
  "cacheReadInputTokenCount": 29841,
  "cacheWriteInputTokenCount": 0
}

Tea usage the response part contains two new counters: cacheReadInputTokenCount and cacheWriteInputTokenCount. The total number of tokens to invoke is the sum of the input and output tokens plus the tokens read and written to the cache.

Each invocation processes a list of messages. The messages in the first recall contain the documents, the first question, and the cache point. Because messages preceding the cache point are not currently in the cache, they are written to the cache. According to usage counters, 29,841 tokens were written to the cache.

"cacheWriteInputTokenCount": 29841

For further recall, the previous answer and the new question are added to the message list. News before cachePoint they are not changed and found in the cache.

As expected, we can say z usage calculates that the same number of previously written tokens is now fetched from the cache.

"cacheReadInputTokenCount": 29841

In my tests, subsequent invocations take 55 percent less time to complete compared to the first. Depending on your use case (for example, with more cached content), fast caching can improve latency by up to 85 percent.

Depending on the model, you can set more than one cache point in the message list. To find the right cache points for your use case, try different configurations and see the impact on reported usage.

Things you should know
Amazon Bedrock Intelligent Prompt Routing is available today in Preview in the AWS US East (N. Virginia) and US West (Oregon) regions. You can use the default challenge routers during the preview and there is no additional cost to use the challenge router. You pay the price of the selected model. You can use prompt routers with other Amazon Bedrock features, such as performing assessments, using knowledge bases, and configuring agents.

Because the internal model used by prompt routers must understand the complexity of the prompt, intelligent prompt routing currently only supports prompts in English.

Amazon Bedrock fast caching support is available in US West (Oregon) preview for Anthropic’s Claude 3.5 Sonnet V2 and Claude 3.5 Haiku. Prompt caching is also available in US East (N. Virginia) for Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro. You can request access to the Amazon Bedrock challenge cache preview here.

Due to fast caching, cached reads receive a 90 percent discount compared to uncached input tokens. There are no additional infrastructure charges for caching. When using Anthropic Models, you pay additional costs for cached tokens. There is no additional cost for cache writes on Amazon Nova models. See Amazon Bedrock Pricing for more information.

When using fast caching, content is cached for up to 5 minutes, with each cache hit clearing this countdown. Fast caching has been implemented to transparently support derivation across regions. This way, your applications can benefit from the cost and latency optimization of fast caching with the flexibility of cross-region derivation.

These new features make it easier to build cost-effective and high-performance generative AI applications. By intelligently routing requests and caching frequently used content, you can significantly reduce your costs while maintaining and even improving application performance.

To learn more and start using these new features today, visit the Amazon Bedrock documentation and submit feedback to AWS re:Post for Amazon Bedrock. On community.aws, you can find detailed technical content and see how our Builder communities use Amazon Bedrock.

Danilo

Leave a Reply

Your email address will not be published. Required fields are marked *