Using OpenAI Batch API

This tutorial demonstrates how to use the OpenAI API’s batch endpoint to process multiple tasks efficiently, achieving a 50% cost savings with guaranteed results within 24 hours. The service is ideal for processing jobs that don’t require immediate responses.

Eligible researchers  can email research-programming@wharton.upenn.edu to start the process of obtaining an OpenAI API key.

Import the necessary libraries:

import os
import json
import time
import pandas as pd
from dotenv import load_dotenv
from openai import OpenAI

Load environment variables and initialize the OpenAI client:

# Load environment variables from .env file
load_dotenv()

# Read API key from environment variables
api_key = os.getenv("OPENAI_API_KEY")

# Initialize OpenAI client
client = OpenAI(api_key=api_key)

Load the list of famous people from a CSV file and display the first 10 entries:

df = pd.read_csv('famous_people.csv')
df.head(10)

Output:

id prompt
0 1 Fela Kuti
1 2 Marie Curie
2 3 Albert Einstein
3 4 Nelson Mandela
4 5 Mahatma Gandhi
5 6 Frida Kahlo
6 7 Winston Churchill
7 8 Che Guevara
8 9 Bruce Lee
9 10 Serena Williams

Create tasks for the batch endpoint using Structured Outputs:

Structured Outputs ensure that each batch result follows the same JSON schema, making the results easier to parse reliably.

birthplace_schema = {
    "type": "object",
    "properties": {
        "city_or_town": {"type": "string"},
        "country": {"type": "string"}
    },
    "required": ["city_or_town", "country"],
    "additionalProperties": False
}

tasks = []

for _, row in df.iterrows():
    person_id = str(row["id"])
    person_name = row["prompt"]

    task = {
        "custom_id": person_id,
        "method": "POST",
        "url": "/v1/responses",
        "body": {
            "model": "gpt-4o-mini",
            "input": [
                {
                    "role": "system",
                    "content": "Identify factual birthplaces of well known individuals. Return only the requested structured data."
                },
                {
                    "role": "user",
                    "content": f"Identify the place of birth for {person_name}."
                }
            ],
            "text": {
                "format": {
                    "type": "json_schema",
                    "name": "birthplace",
                    "strict": True,
                    "schema": birthplace_schema
                }
            }
        }
    }

    tasks.append(task)

Inspect the first task:

tasks[0]

Save the tasks to a JSONL file:

file_name = "batch_tasks_birthplaces.jsonl"

with open(file_name, "w", encoding="utf-8") as file:
    for task in tasks:
        file.write(json.dumps(task) + "\n")

Upload the JSONL file to the OpenAI API:

batch_file = client.files.create(
    file=open(file_name, "rb"),
    purpose="batch"
)

print(batch_file)

Start the batch job:

batch_job = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/responses",
    completion_window="24h"
)

Monitor the batch job until completion:

while True:
    batch_job = client.batches.retrieve(batch_job.id)

    if batch_job.status == "completed":
        print(f"job {batch_job.id} is done")
        break

    if batch_job.status in ["failed", "expired", "cancelled"]:
        raise RuntimeError(f"Batch job ended with status: {batch_job.status}")

    time.sleep(10)

Retrieve and process the structured results:

results_by_id = {}

if batch_job.output_file_id:
    result_bytes = client.files.content(batch_job.output_file_id).content
    result_entries = result_bytes.decode("utf-8").strip().splitlines()

    for entry in result_entries:
        res = json.loads(entry)
        custom_id = res["custom_id"]

        if res.get("error"):
            results_by_id[custom_id] = None
            continue

        body = res["response"]["body"]
        text = body["output"][0]["content"][0]["text"]
        data = json.loads(text)

        results_by_id[custom_id] = f'{data["city_or_town"]}, {data["country"]}'

df["Place of Birth"] = df["id"].astype(str).map(results_by_id)
df

Output:

id prompt Place of Birth
0 1 Fela Kuti Abeokuta,Nigeria
1 2 Marie Curie Warsaw,Poland
2 3 Albert Einstein Ulm,Germany
3 4 Nelson Mandela Mvezo, South Africa
4 5 Mahatma Gandhi Porbandar,India
5 6 Frida Kahlo Coyoacán,Mexico
6 7 Winston Churchill Woodstock, England
7 8 Che Guevara Rosario,Argentina
8 9 Bruce Lee San Francisco,United States
9 10 Serena Williams Saginaw,United States