Skip to content

How to utilize vision of LLM? #159

Closed
@mocheng

Description

@mocheng

Question

How to utilize vision capability of LLM with OpenAI Agent SDK?

The API should support specifying image URL or local image path or base64 string. The documentation seems lacking this feature.

Code example of "Describe uploaded image" would be appreciated.

Activity

DanieleMorotti

DanieleMorotti commented on Mar 14, 2025

@DanieleMorotti
Contributor

Hi, I tested the following code and it works both with a url and base64 images. As always, you can use OpenAI code to encode an image to base64:

import base64

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# Path to your image
image_path = "image.jpeg"

# Getting the Base64 string
base64_image = encode_image(image_path)

And then you can pass the base64 image or a url:

from agents import Agent, Runner, ModelSettings

agent = Agent(
    name="Assistant",
    model="gpt-4o-mini",
    model_settings=ModelSettings(temperature=0.4, max_tokens=1024),
    instructions="Given an input image you will generate the description of the image in the style specified by the user."
)

result = await Runner.run(agent, input=[
    {
        "role": "user",
        "content": [
            {"type": "input_text", "text": "Describe this image with an haiku."},
            {
                "type": "input_image",
                "image_url": f"data:image/jpeg;base64,{base64_image}", # or your url "https://..."
            },
        ]
    }
])
print(result.final_output)

I don't know if this is the best available method, but I hope this may be useful.

mocheng

mocheng commented on Mar 15, 2025

@mocheng
Author

@DanieleMorotti It works. Thanks!

xin-ran-w

xin-ran-w commented on Apr 2, 2025

@xin-ran-w

I still have a question about whether an MCP server can return an image as the result. I want to let the gpt-4o see how I can do this.

DanieleMorotti

DanieleMorotti commented on Apr 2, 2025

@DanieleMorotti
Contributor

I first tried to adopt tool_use_behavior={"stop_at_tool_name": [...]} but it didn't work for me.

The only idea that comes to my mind is to pass a function as tool_use_behaviour, where you select the output of the tool and return it as a final output.

from agents import Agent, Runner, ToolsToFinalOutputResult

def stop_at_tool(context, tools_resp):
    res = tools_resp[0].output
    return ToolsToFinalOutputResult(is_final_output=True, final_output=res)

mn_agent = Agent(
        name="Image descriptor agent",
        model="gpt-4o",
        instructions="You have to describe the image requested by the user",
        model_settings=ModelSettings(temperature=0.3, max_tokens=2048),
        mcp_servers=[mcp_server],
        tool_use_behavior=stop_at_tool
)

Then, you may check the output returned by the mcp server and append to the chat history a new message with the returned image, such that you can correctly send the image to the LLM.

abeltre1

abeltre1 commented on May 8, 2025

@abeltre1

Hi, I tested the following code and it works both with a url and base64 images. As always, you can use OpenAI code to encode an image to base64:

import base64

Function to encode the image

def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")

Path to your image

image_path = "image.jpeg"

Getting the Base64 string

base64_image = encode_image(image_path)
And then you can pass the base64 image or a url:

from agents import Agent, Runner, ModelSettings

agent = Agent(
name="Assistant",
model="gpt-4o-mini",
model_settings=ModelSettings(temperature=0.4, max_tokens=1024),
instructions="Given an input image you will generate the description of the image in the style specified by the user."
)

result = await Runner.run(agent, input=[
{
"role": "user",
"content": [
{"type": "input_text", "text": "Describe this image with an haiku."},
{
"type": "input_image",
"image_url": f"data:image/jpeg;base64,{base64_image}", # or your url "https://..."
},
]
}
])
print(result.final_output)
I don't know if this is the best available method, but I hope this may be useful.

Is there an API for the OpenAI SDK that facilitates seamless interaction between two different agents—specifically, a vision model and a language model—where the handoff between them is straightforward? How can I effectively communicate during this handoff whether I will be using the vision model or the large language model? Without such a mechanism, the input may fail when sent to the vision model if it is not appropriately formatted. This is particularly important if I do not implement the mechanism you described earlier. @DanieleMorotti

rjawaji1

rjawaji1 commented on May 12, 2025

@rjawaji1

The provided code gives me this as an error

❯ uv run main.py
Traceback (most recent call last):
  File "/Users/redacted/Developer/Private/Side/agents/main.py", line 65, in <module>
    asyncio.run(main())
    ~~~~~~~~~~~^^^^^^^^
  File "/Users/redacted/.local/share/uv/python/cpython-3.13.3-macos-aarch64-none/lib/python3.13/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ~~~~~~~~~~^^^^^^
  File "/Users/redacted/.local/share/uv/python/cpython-3.13.3-macos-aarch64-none/lib/python3.13/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/Users/redacted/.local/share/uv/python/cpython-3.13.3-macos-aarch64-none/lib/python3.13/asyncio/base_events.py", line 719, in run_until_complete
    return future.result()
           ~~~~~~~~~~~~~^^
  File "/Users/redacted/Developer/Private/Side/agents/main.py", line 40, in main
    result = await Runner.run(
             ^^^^^^^^^^^^^^^^^
    ...<17 lines>...
    )
    ^
  File "/Users/redacted/Developer/Private/Side/agents/.venv/lib/python3.13/site-packages/agents/run.py", line 218, in run
    input_guardrail_results, turn_result = await asyncio.gather(
                                           ^^^^^^^^^^^^^^^^^^^^^
    ...<19 lines>...
    )
    ^
  File "/Users/redacted/Developer/Private/Side/agents/.venv/lib/python3.13/site-packages/agents/run.py", line 760, in _run_single_turn
    new_response = await cls._get_new_response(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<10 lines>...
    )
    ^
  File "/Users/redacted/Developer/Private/Side/agents/.venv/lib/python3.13/site-packages/agents/run.py", line 919, in _get_new_response
    new_response = await model.get_response(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<10 lines>...
    )
    ^
  File "/Users/redacted/Developer/Private/Side/agents/.venv/lib/python3.13/site-packages/agents/models/openai_chatcompletions.py", line 61, in get_response
    response = await self._fetch_response(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
    )
    ^
  File "/Users/redacted/Developer/Private/Side/agents/.venv/lib/python3.13/site-packages/agents/models/openai_chatcompletions.py", line 193, in _fetch_response
    converted_messages = Converter.items_to_messages(input)
  File "/Users/redacted/Developer/Private/Side/agents/.venv/lib/python3.13/site-packages/agents/models/chatcmpl_converter.py", line 302, in items_to_messages
    "content": cls.extract_all_content(content),
               ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
  File "/Users/redacted/Developer/Private/Side/agents/.venv/lib/python3.13/site-packages/agents/models/chatcmpl_converter.py", line 237, in extract_all_content
    "detail": casted_image_param["detail"],
              ~~~~~~~~~~~~~~~~~~^^^^^^^^^^
KeyError: 'detail'

using this code

async def main():
    base_agent = AsyncOpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
    set_default_openai_client(base_agent)

    agent = Agent(
        name="Assistant",
        instructions="You are a helpful assistant.",
        model="gemma3:4b-it-qat",
    )

    result = await Runner.run(
        agent,
        input=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_text",
                        "text": "Can you tell me something about this image?",
                    },
                    {"type": "input_image", "image_url": IMAGE_URL},
                ],
            },
        ],
    )
    print(result.final_output)


if __name__ == "__main__":
    set_tracing_disabled(True)
    set_default_openai_api("chat_completions")
    asyncio.run(main())
DanieleMorotti

DanieleMorotti commented on May 12, 2025

@DanieleMorotti
Contributor

@therjawaji you're right, there's an error and the code tries to access that key even if it doesn't exist. I implemented a PR to fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionQuestion about using the SDK

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @mocheng@abeltre1@DanieleMorotti@rjawaji1@xin-ran-w

        Issue actions

          How to utilize vision of LLM? · Issue #159 · openai/openai-agents-python