Try Gemini 1.5 Pro, our most advanced multimodal model in Vertex AI, and see what you can build with a 1M token context window. Try Gemini 1.5 Pro, our most advanced multimodal model in Vertex AI, and see what you can build with a 1M token context window.

Use Visual Question Answering (VQA) to get image information

Visual Question Answering (VQA) lets you provide an image to the model and ask a question about the image's contents. In response to your question you get one or more natural language answers.

Sample VQA image, question and answers in the console — ^{Image source (shown in Google Cloud console): Sharon Pittaway on Unsplash

Prompt question: What objects are in the image?

Answer 1: marbles

Answer 2: glass marbles}

Languages supported

VQA is available in the following languages:

English (en)

Performance and limitations

The following limits apply when you use this model:

Limits	Value
Maximum number of API requests (short-form) per minute per project	500
Maximum number of tokens returned in response (short-form)	64 tokens
Maximum number of tokens accepted in request (VQA short-form only)	80 tokens

The following service latency estimates apply when you use this model. These values are meant to be illustrative and are not a promise of service:

Latency	Value
API requests (short-form)	1.5 seconds

Locations

A location is a region you can specify in a request to control where data is stored at rest. For a list of available regions, see Generative AI on Vertex AI locations.

Use VQA on an image (short-form responses)

Use the following samples to ask a question and get an answer about an image.

Console

In the Google Cloud console, open the Vertex AI Studio > Vision tab in the Vertex AI dashboard.

Go to the Vertex AI Studio tab
In the lower menu, click Visual Q & A.
Click Upload image to select your local image to caption.
In the Parameters panel, choose your Number of captions and Language.
In the prompt field, enter a question about your uploaded image.
Click Submit.

REST

For more information about imagetext model requests, see the imagetext model API reference.

Before using any of the request data, make the following replacements:

PROJECT_ID: Your Google Cloud project ID.
LOCATION: Your project's region. For example, us-central1, europe-west2, or asia-northeast3. For a list of available regions, see Generative AI on Vertex AI locations.
VQA_PROMPT: The question you want to get answered about your image.
- What color is this shoe?
- What type of sleeves are on the shirt?
B64_IMAGE: The image to get captions for. The image must be specified as a base64-encoded byte string. Size limit: 10 MB.
RESPONSE_COUNT: The number of answers you want to generate. Accepted integer values: 1-3.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict

Request JSON body:

{
  "instances": [
    {
      "prompt": "VQA_PROMPT",
      "image": {
          "bytesBase64Encoded": "B64_IMAGE"
      }
    }
  ],
  "parameters": {
    "sampleCount": RESPONSE_COUNT
  }
}

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict"

PowerShell

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict" | Select-Object -Expand Content

The following sample responses are for a request with "sampleCount": 2 and "prompt": "What is this?". The response returns two prediction string answers.

{
  "predictions": [
    "cappuccino",
    "coffee"
  ]
}

Python

Before trying this sample, follow the Python setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Python API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

In this sample you use the load_from_file method to reference a local file as the base Image to get information about. After you specify the base image, you use the ask_question method on the ImageTextModel and print the answers.


import vertexai
from vertexai.preview.vision_models import Image, ImageTextModel

# TODO(developer): Update and un-comment below lines
# project_id = "PROJECT_ID"
# input_file = "my-input.png"
# question = "" # The question about the contents of the image.

vertexai.init(project=project_id, location="us-central1")

model = ImageTextModel.from_pretrained("imagetext@001")
source_img = Image.load_from_file(location=input_file)

answers = model.ask_question(
    image=source_img,
    question=question,
    # Optional parameters
    number_of_results=1,
)

print(answers)

Use parameters for VQA

When you get VQA responses there are several parameters you can set depending on your use case.

Number of results

Use the number of results parameter to limit the amount of responses returned for each request you send. For more information, see the imagetext (VQA) model API reference.

Seed number

A number you add to a request to make generated responses deterministic. Adding a seed number with your request is a way to assure you get the same prediction (responses) each time. However, the answers aren't necessarily returned in the same order. For more information, see the imagetext (VQA) model API reference.

What's next

View videos describing Vertex AI foundation models including Imagen, the text-to-image foundation model that lets you generate and edit images:
- Introduction to foundation models on Google Cloud
- Imagen on Vertex AI: Create and edit images from text
Read blog posts describing Imagen on Vertex AI and Generative AI on Vertex AI: