Amazon SageMaker Jumpstart is a fantastic tool to deploy machine learning models. In this blog, we'll walk you through the process of using Jumpstart to deploy a model named "Llama2" and expose it as an API endpoint using AWS Lambda functions.

Step 1: Set Up SageMaker Domain and User

Begin by navigating to the Amazon SageMaker service in the AWS Console. Here, create a domain and user profile. This will allow you to manage access and track your experiments.

Step 2a: Deploy the Llama2 Model

After setting up your domain and user, head over to the SageMaker console. Look for the "Sagemaker Jumpstart" section, and from there, choose to deploy the "llama 7b model." This process usually takes about seven minutes, and at the end of it, you'll be provided with an endpoint name. Make a note of this, as it will be essential for the next steps.

Step 2b: Integrating any Huggingface Model with Amazon Sagemaker

Instead of Amazon Sagemaker Jumpstart model options, you can also bring any HuggingFace model too and have it setup. Here's how

1. Setup development environment

We are going to use the sagemaker python SDK to deploy Llama 2 to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker python SDK installed.

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

def setup_sagemaker_session(default_bucket=None):
    - default_bucket: Default bucket name to use for the session

    - session: SageMaker session object
    - role_arn: ARN of the IAM execution role
    global sagemaker_execution_role
    session = sagemaker.Session(default_bucket=default_bucket)

        sagemaker_execution_role = sagemaker.get_execution_role()
    except ValueError:
        iam = boto3.client('iam')
        sagemaker_execution_role = iam.get_role(RoleName="sagemaker_execution_role")['Role']['Arn']

    return session, sagemaker_execution_role

def mask_account_id(account_id):
    return '*' * len(account_id)

def main():
    sagemaker_session_bucket = None

    session, sagemaker_execution_role = setup_sagemaker_session(default_bucket=sagemaker_session_bucket)

    # Mask it
    account_id = sagemaker_execution_role.split(':')[4]
    masked_account_id = mask_account_id(account_id)
    masked_role = sagemaker_execution_role.replace(account_id, masked_account_id)

    print(f"SageMaker role ARN: {masked_role}")
    print(f"SageMaker session region: {session.boto_region_name}")

if __name__ == "__main__":

2. Retrieve the new Hugging Face LLM DLC

Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our HuggingFaceModel model class with a image_uri pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the get_huggingface_llm_image_uri method provided by the sagemaker SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend, session, region, and version. You can find the available versions here

from sagemaker.huggingface import get_huggingface_llm_image_uri

## Fetch docker image URI for the Hugging Face DLC:
# 1. backend name
# 2. Hugging face LDC version

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(

# print ecr image uri
print(f"llm image uri: {llm_image}")

Hardware Requirements

Llama 2 comes in 3 different sizes - 7B, 13B & 70B parameters. The hardware requirements will vary based on the model size deployed to SageMaker. Below is a set up minimum requirements for each model size we tested.

Note: We haven't tested GPTQ models yet.

Model Instance Type Quantization # of GPUs per replica
Llama 7B (ml.)g5.2xlarge - 1
Llama 13B (ml.)g5.12xlarge - 4
Llama 70B (ml.)g5.48xlarge bitsandbytes 8
Llama 70B (ml.)p4d.24xlarge - 8

Note: Amazon SageMaker currently doesn't support instance slicing meaning, e.g. for Llama 70B you cannot run multiple replica on a single instance.

These are the minimum setups we have validated for 7B, 13B and 70B LLaMA 2 models to work on SageMaker. In the coming weeks, we plan to run detailed benchmarking covering latency and throughput numbers across different hardware configurations. We are currently not recommending deploying Llama 70B to g5.48xlarge instances, since long request can timeout due to the 60s request timeout limit for SageMaker. Use p4d instances for deploying Llama 70B it.

It might be possible to run Llama 70B on g5.48xlarge instances without quantization by reducing the MAX_TOTAL_TOKENS and MAX_BATCH_TOTAL_TOKENS parameters. We haven't tested this yet.

# confirm requirements met for kernel
import json

def get_instance_type_from_metadata():
    with open('/opt/ml/metadata/resource-metadata.json') as f:
        metadata = json.load(f)
        resource_name = metadata.get('ResourceName', '')
    return resource_name

def main():
    resource_name = get_instance_type_from_metadata()

    # List valid instance types
    valid_instance_types = ['ml.g5.12xlarge', 'ml.g5-48xlarge']

    if any(instance_type in resource_name for instance_type in valid_instance_types):
        print("Instance configured correctly")
        print("Need to upgrade to at least 'ml.g5.12xlarge' instance")

if __name__ == "__main__":

4. Deploy Llama2 to Amazon Sagemaker

To deploy meta-llama/Llama-2-13b-chat-hf to Amazon SageMaker we create a HuggingFaceModel model class and define our endpoint configuration including the hf_model_id, instance_type etc. We will use a g5.12xlarge instance type, which has 4 NVIDIA A10G GPUs and 96GB of GPU memory.

Note: This is a form to enable access to Llama 2 on Hugging Face after you have been granted access from Meta. Please visit the Meta website and accept our license terms and acceptable use policy before submitting this form. Requests will be processed in 1-2 days.

import json
import getpass
from sagemaker.huggingface import HuggingFaceModel

def get_sagemaker_config():
    # sagemaker config
    instance_type = "ml.p4d.24xlarge"
    number_of_gpu = 8
    health_check_timeout = 300

    # Define Model and Endpoint configuration parameter
    config = {
      'HF_MODEL_ID': "meta-llama/Llama-2-13b-chat-hf", # model_id from
      'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
      'MAX_INPUT_LENGTH': json.dumps(2048),  # Max length of input text
      'MAX_TOTAL_TOKENS': json.dumps(4096),  # Max length of the generation (including input text)
      'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),  # Limits the number of tokens that can be processed in parallel during the generation
      'HUGGING_FACE_HUB_TOKEN': getpass.getpass("Enter your Hugging Face hub token")
      # ,'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
    return instance_type, health_check_timeout, config

def create_huggingface_model(instance_type, config, role, image_uri):
    # check if token is set
    assert config['HUGGING_FACE_HUB_TOKEN'] != "", "Please set your Hugging Face Hub token"

    # create HuggingFaceModel with the image uri
    llm_model = HuggingFaceModel(
    return llm_model

def main():
    instance_type, health_check_timeout, config = get_sagemaker_config()
    # Set role and image_uri
    role = sagemaker_execution_role
    llm_image_to_ref = llm_image
    # Declare llm model with create_hugging_face_model module
    llm_model = create_huggingface_model(instance_type, config, role, llm_image_to_ref)
    # Deploy model to the endpoint if llm_model is available
    if llm_model:
            llm = llm_model.deploy(
                initial_instance_count = 1,
if __name__ == "__main__":

5. Run inference and chat with the model

5. Run inference and chat with the model

After our endpoint is deployed we can run inference on it. We will use the predict method from the predictor to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the parameters attribute of the payload. As of today the TGI supports the following parameters:

  • temperature: Controls randomness in the model. Lower values will make the model more deterministic and higher values will make the model more random. Default value is 1.0.
  • max_new_tokens: The maximum number of tokens to generate. Default value is 20, max value is 512.
  • repetition_penalty: Controls the likelihood of repetition, defaults to null.
  • seed: The seed to use for random generation, default is null.
  • stop: A list of tokens to stop the generation. The generation will stop when one of the tokens is generated.
  • top_k: The number of highest probability vocabulary tokens to keep for top-k-filtering. Default value is null, which disables top-k-filtering.
  • top_p: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling, default to null.
  • do_sample: Whether or not to use sampling ; use greedy decoding otherwise. Default value is false.
  • best_of: Generate best_of sequences and return the one if the highest token logprobs, default to null.
  • details: Whether or not to return details about the generation. Default value is false.
  • return_full_text: Whether or not to return the full text or only the generated part. Default value is false.
  • truncate: Whether or not to truncate the input to the maximum length of the model. Default value is true.
  • typical_p: The typical probability of a token. Default value is null.
  • watermark: The watermark to use for the generation. Default value is false.

You can find the open api specification of the TGI in the swagger documentation

The meta-llama/Llama-2-13b-chat-hf is a conversational chat model meaning we can chat with it using the following prompt:

[INST] <> {{ system_prompt }} <>

{{ user_msg_1 }} [/INST] {{ model_answer_1 }} [INST] {{ user_msg_2 }} [/INST]

We create a small helper method build_llama2_prompt, which converts a List of "messages" into the prompt format. We also define a system_prompt which is used to start the conversation. We will use the system_prompt to ask the model about some cool ideas to do in the summer.

def build_llama2_prompt(messages):
    startPrompt = "<s>[INST] "
    endPrompt = " [/INST]"
    conversation = []
    for index, message in enumerate(messages):
        if message["role"] == "system" and index == 0:
        elif message["role"] == "user":
            conversation.append(f" [/INST] {message['content'].strip()} </s><s>[INST] ")

    return startPrompt + "".join(conversation) + endPrompt
messages = [
  { "role": "system","content": "You are a friendly and knowledgeable vacation planning assistant named Clara. Your goal is to have natural conversations with users to help them plan their perfect vacation. "}
# define question and add to messages
instruction = "What are some cool ideas to do in the summer?"
messages.append({"role": "user", "content": instruction})
prompt = build_llama2_prompt(messages)

chat = llm.predict({"inputs":prompt})


Now we will run inference with different parameters to impact the generation. Parameters can be defined as in the parameters attribute of the payload.

# hyperparameters for llm
payload = {
  "inputs":  prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03,
    "stop": ["</s>"]

# send request to endpoint
response = llm.predict(payload)


Step 3: Integrate with AWS Lambda

With your model deployed and the endpoint name at hand, navigate to AWS Lambda. Here, create a function that calls this endpoint. Below is the Python code for the Lambda function that invokes the SageMaker endpoint:

import json
import boto3

runtime= boto3.client('runtime.sagemaker')

def lambda_handler(event, context):
    data = event["body"]
    response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
    response_content = response['Body'].read().decode()
    result = json.loads(response_content)
    return {
        'statusCode': 200,
        'body': json.dumps(result)

Step 4: Grant Necessary Permissions

For your Lambda function to work correctly, it requires specific permissions to invoke the SageMaker endpoint. To grant these permissions, create an inline policy in AWS Identity and Access Management (IAM) with the following JSON:

  "Version": "2012-10-17",
  "Statement": [
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": "sagemaker:InvokeEndpoint",
      "Resource": "*"

Step 5: Expose Lambda Function as an API

Once your function is ready and permissions are in place, expose this Lambda function as a URL. This makes it accessible over the web and can be called using common web tools.

Step 6: Testing Your Deployment

To test the setup, use a curl command, like the one below:

curl --location '' \
--header 'Content-Type: application/json' \
--data '{
    "inputs":"I believe the meaning of life is",

This command sends a POST request to your Lambda function, which in turn invokes the SageMaker endpoint and returns the model's prediction.

In Conclusion

Setting up a machine learning model and exposing it as an API using Amazon SageMaker Jumpstart and AWS Lambda is a straightforward process. It offers scalability and ease of integration, making it suitable for a range of applications. For more detailed steps and best practices, you can refer to this official AWS blog.