Use OpenAI’s Vision API to get a description of an image

In this blog post I will show you how we can use the Vision API from OpenAI to generate stunning descriptions of any image.

Sebastian Jensen

Published in

Medialesson

3 min readNov 13, 2023

Introduction

During the Open AI Dev Day Keynote, Sam Altman annouced that the newly Vision API will be available. In this blog post we will write a simple console application in C# to make a request to the Vision API to generate a textual description of an image.

Prerequisites

You need to create an OpenAI account on this website. You have to pay to use the API so make sure that you add your payment information. After that, you can create an API Key here for further use.

Let’s code

I am using Visual Studio to create a new .NET console application using .NET 7. At the moment there is no NuGet package for the OpenAI API communication. So we have to call the API by ourselfs.

Add the Spectre.Console NuGet package to the project and create a new folder called Models where we will add the request model. Due to the fact that we are using the Text Completion API, we need several models. I won’t post any of them here, but you will find all the needed files in the GitHub repository.

After you’ve created all the needed models, we can open the Program.cs file to build the business logic.

First the console application is asking the user to provide the needed parameters. In this case we just need the API key and the file path of a stored image file.

We use the completions endpoint to post our properties to the API. The response is a JSON object containing the textual description of the image and additional properties. The console application will write the result to the console.

Sample

Let’s run the console application. You will see the header and the first input for the OpenAI API key. Now you can provide all the needed data.

I’ve used the following sample image to get a description.

Here is the textual description of the image.

The image features the Sydney Opera House, which is an iconic and distinctive building located in Sydney, New South Wales, Australia. It is famous for its unique series of white shell-like structures on its roof, often said to resemble sails or shells. In the background, you can see a part of the Sydney Harbour, including a large cruise ship docked nearby. There is also a smaller boat in the water, and some industrial or historical structures on the waterfront. A crane indicates some construction or maintenance work might be happening in the area. This is a view that captures both the cultural landmark of the Opera House and a slice of the busy maritime activity in Sydney Harbour.

Conclusion

In this blog post I’ve written a simple .NET console application to use the newly published Vision API by using the Completions API. So it is pretty simple to use the API, but keep in mind that you have to pay for each call against the API.

You can find the source code of the console aplication on my GitHub profile.

I’ve also published a blog post about the Text-To-Speech API, another blog post about the Speech-To-Text API and another blog post talking about the DALL-E 3 API here on medium.