This tutorial demonstrates how to use the OpenAI API’s batch endpoint to process multiple tasks efficiently, achieving a 50% cost savings with guaranteed results within 24 hours. The service is ideal for processing jobs that don’t require immediate responses.
Eligible researchers can email research-programming@wharton.upenn.edu to start the process of obtaining an OpenAI API key.
Import the necessary libraries:
import os import json import time import pandas as pd from dotenv import load_dotenv from openai import OpenAI
Load environment variables and initialize the OpenAI client:
# Load environment variables from .env file
load_dotenv()
# Read API key from environment variables
api_key = os.getenv("OPENAI_API_KEY")
# Initialize OpenAI client
client = OpenAI(api_key=api_key)
Load the list of famous people from a CSV file and display the first 10 entries:
df = pd.read_csv('famous_people.csv')
df.head(10)
Output:
| id | prompt | |
|---|---|---|
| 0 | 1 | Fela Kuti |
| 1 | 2 | Marie Curie |
| 2 | 3 | Albert Einstein |
| 3 | 4 | Nelson Mandela |
| 4 | 5 | Mahatma Gandhi |
| 5 | 6 | Frida Kahlo |
| 6 | 7 | Winston Churchill |
| 7 | 8 | Che Guevara |
| 8 | 9 | Bruce Lee |
| 9 | 10 | Serena Williams |
Create tasks for the batch endpoint to identify each person’s place of birth:
By using the batch endpoint, we can process multiple tasks efficiently, achieving a 50% cost savings with guaranteed results within 24 hours.
# Create an array of tasks where the prompt is to identify the person's place of birth
tasks = []
for index, row in df.iterrows():
person_name = row['prompt']
task = {
"custom_id": str(index), # custom_id must be a string
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"temperature": 0.1,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant tasked with identifying the birthplaces of famous individuals."
},
{
"role": "user",
"content": f"""Identify the place of birth for {person_name}.
Format the result like `<city/town>,<country>`
Do not include any other information"""
}
],
}
}
tasks.append(task)
Inspect the first task:
tasks[0]
Output:
{'custom_id': '0',
'method': 'POST',
'url': '/v1/chat/completions',
'body': {'model': 'gpt-4o-mini',
'temperature': 0.1,
'messages': [{'role': 'system',
'content': 'You are a helpful assistant tasked with identifying the birthplaces of famous individuals.'},
{'role': 'user',
'content': 'Identify the place of birth for Fela Kuti. \n Format the result like `<city/town>,`\n Do not include any other information'}]}}
Save the tasks to a JSONL file:
# Creating the jsonl file
file_name = "batch_tasks_movies.jsonl"
with open(file_name, 'w') as file:
for obj in tasks:
file.write(json.dumps(obj) + '\n')
Upload the JSONL file to the OpenAI API:
# Upload to OpenAI file API
batch_file = client.files.create(
file=open(file_name, "rb"),
purpose="batch"
)
print(batch_file)
Output:
FileObject(id='file-qD2rLGWQOQWvMIqSGc8vru7h', bytes=4896, created_at=1729607937, filename='batch_tasks_movies.jsonl', object='file', purpose='batch', status='processed', status_details=None)
Start the batch job:
# Start batch job
batch_job = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
Monitor the batch job until completion:
while True:
batch_job = client.batches.retrieve(batch_job.id)
if batch_job.status != "completed":
time.sleep(10)
else:
print(f"job {batch_job.id} is done")
break
Output:
job batch_6717b90334108190970786094985c0de is done
Retrieve and process the results:
# Get completed file
results_list = []
result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content
result = result.decode('utf-8')
result_entries = result.strip().split("\n")
for r in result_entries:
results_list.append(json.loads(r))
birthplaces = []
# Reading only the first results
for res in results_list:
index = res['custom_id']
# Getting index from task id
result = res['response']['body']['choices'][0]['message']['content']
row = df.iloc[int(index)]
person = row['prompt']
print(f"{person}\t: {result}")
print("\n\n----------------------------\n\n")
birthplaces.append(result)
Output:
Fela Kuti : Abeokuta,Nigeria ---------------------------- Marie Curie : Warsaw,Poland ---------------------------- Albert Einstein : Ulm,Germany ---------------------------- Nelson Mandela : Mvezo, South Africa ---------------------------- Mahatma Gandhi : Porbandar,India ---------------------------- Frida Kahlo : Coyoacán,Mexico ---------------------------- Winston Churchill : Woodstock, England ---------------------------- Che Guevara : Rosario,Argentina ---------------------------- Bruce Lee : San Francisco,United States ---------------------------- Serena Williams : Saginaw,United States ----------------------------
Add the birthplaces to the DataFrame and display the updated data:
df['Place of Birth'] = birthplaces df
Output:
| id | prompt | Place of Birth | |
|---|---|---|---|
| 0 | 1 | Fela Kuti | Abeokuta,Nigeria |
| 1 | 2 | Marie Curie | Warsaw,Poland |
| 2 | 3 | Albert Einstein | Ulm,Germany |
| 3 | 4 | Nelson Mandela | Mvezo, South Africa |
| 4 | 5 | Mahatma Gandhi | Porbandar,India |
| 5 | 6 | Frida Kahlo | Coyoacán,Mexico |
| 6 | 7 | Winston Churchill | Woodstock, England |
| 7 | 8 | Che Guevara | Rosario,Argentina |
| 8 | 9 | Bruce Lee | San Francisco,United States |
| 9 | 10 | Serena Williams | Saginaw,United States |
