Using OpenAI Batch API

This tutorial demonstrates how to use the OpenAI API’s batch endpoint to process multiple tasks efficiently, achieving a 50% cost savings with guaranteed results within 24 hours. The service is ideal for processing jobs that don’t require immediate responses.

Eligible researchers can email research-programming@wharton.upenn.edu to start the process of obtaining an OpenAI API key.

Import the necessary libraries:

import os
import json
import time
import pandas as pd
from dotenv import load_dotenv
from openai import OpenAI

Load environment variables and initialize the OpenAI client:

# Load environment variables from .env file
load_dotenv()

# Read API key from environment variables
api_key = os.getenv("OPENAI_API_KEY")

# Initialize OpenAI client
client = OpenAI(api_key=api_key)

Load the list of famous people from a CSV file and display the first 10 entries:

df = pd.read_csv('famous_people.csv')
df.head(10)

Output:

	id	prompt
0	1	Fela Kuti
1	2	Marie Curie
2	3	Albert Einstein
3	4	Nelson Mandela
4	5	Mahatma Gandhi
5	6	Frida Kahlo
6	7	Winston Churchill
7	8	Che Guevara
8	9	Bruce Lee
9	10	Serena Williams

Create tasks for the batch endpoint to identify each person’s place of birth:

By using the batch endpoint, we can process multiple tasks efficiently, achieving a 50% cost savings with guaranteed results within 24 hours.

# Create an array of tasks where the prompt is to identify the person's place of birth
tasks = []

for index, row in df.iterrows():
    person_name = row['prompt']
    task = {
        "custom_id": str(index),  # custom_id must be a string
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "temperature": 0.1,
            "messages": [
                {
                    "role": "system",
                    "content": "You are a helpful assistant tasked with identifying the birthplaces of famous individuals."
                },
                {
                    "role": "user",
                    "content": f"""Identify the place of birth for {person_name}. 
                                Format the result like `<city/town>,<country>`
                                Do not include any other information"""
                }
            ],
        }
    }
    tasks.append(task)

Inspect the first task:

tasks[0]

Output:

{'custom_id': '0',
 'method': 'POST',
 'url': '/v1/chat/completions',
 'body': {'model': 'gpt-4o-mini',
  'temperature': 0.1,
  'messages': [{'role': 'system',
    'content': 'You are a helpful assistant tasked with identifying the birthplaces of famous individuals.'},
   {'role': 'user',
    'content': 'Identify the place of birth for Fela Kuti. \n                                Format the result like `<city/town>,`\n                                Do not include any other information'}]}}

Save the tasks to a JSONL file:

# Creating the jsonl file
file_name = "batch_tasks_movies.jsonl"

with open(file_name, 'w') as file:
    for obj in tasks:
        file.write(json.dumps(obj) + '\n')

Upload the JSONL file to the OpenAI API:

# Upload to OpenAI file API
batch_file = client.files.create(
    file=open(file_name, "rb"),
    purpose="batch"
)
print(batch_file)

Output:

FileObject(id='file-qD2rLGWQOQWvMIqSGc8vru7h', bytes=4896, created_at=1729607937, filename='batch_tasks_movies.jsonl', object='file', purpose='batch', status='processed', status_details=None)

Start the batch job:

# Start batch job
batch_job = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

Monitor the batch job until completion:

while True:
    batch_job = client.batches.retrieve(batch_job.id)
    if batch_job.status != "completed":
        time.sleep(10)
    else:
        print(f"job {batch_job.id} is done")
        break

Output:

job batch_6717b90334108190970786094985c0de is done

Retrieve and process the results:

# Get completed file
results_list = []
result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content
result = result.decode('utf-8')
result_entries = result.strip().split("\n")
for r in result_entries:
    results_list.append(json.loads(r))

birthplaces = []
# Reading only the first results
for res in results_list:
    index = res['custom_id']
    # Getting index from task id
    result = res['response']['body']['choices'][0]['message']['content']
    row = df.iloc[int(index)]
    person = row['prompt']
    print(f"{person}\t: {result}")
    print("\n\n----------------------------\n\n")
    birthplaces.append(result)

Output:

Fela Kuti	: Abeokuta,Nigeria


----------------------------


Marie Curie	: Warsaw,Poland


----------------------------


Albert Einstein	: Ulm,Germany


----------------------------


Nelson Mandela	: Mvezo, South Africa


----------------------------


Mahatma Gandhi	: Porbandar,India


----------------------------


Frida Kahlo	: Coyoacán,Mexico


----------------------------


Winston Churchill	: Woodstock, England


----------------------------


Che Guevara	: Rosario,Argentina


----------------------------


Bruce Lee	: San Francisco,United States


----------------------------


Serena Williams	: Saginaw,United States


----------------------------

Add the birthplaces to the DataFrame and display the updated data:

df['Place of Birth'] = birthplaces
df

Output:

	id	prompt	Place of Birth
0	1	Fela Kuti	Abeokuta,Nigeria
1	2	Marie Curie	Warsaw,Poland
2	3	Albert Einstein	Ulm,Germany
3	4	Nelson Mandela	Mvezo, South Africa
4	5	Mahatma Gandhi	Porbandar,India
5	6	Frida Kahlo	Coyoacán,Mexico
6	7	Winston Churchill	Woodstock, England
7	8	Che Guevara	Rosario,Argentina
8	9	Bruce Lee	San Francisco,United States
9	10	Serena Williams	Saginaw,United States