This tutorial demonstrates how to use the OpenAI API’s batch endpoint to process multiple tasks efficiently, achieving a 50% cost savings with guaranteed results within 24 hours. The service is ideal for processing jobs that don’t require immediate responses.
Eligible researchers can email research-programming@wharton.upenn.edu to start the process of obtaining an OpenAI API key.
Import the necessary libraries:
import os import json import time import pandas as pd from dotenv import load_dotenv from openai import OpenAI
Load environment variables and initialize the OpenAI client:
# Load environment variables from .env file load_dotenv() # Read API key from environment variables api_key = os.getenv("OPENAI_API_KEY") # Initialize OpenAI client client = OpenAI(api_key=api_key)
Load the list of famous people from a CSV file and display the first 10 entries:
df = pd.read_csv('famous_people.csv') df.head(10)
Output:
id | prompt | |
---|---|---|
0 | 1 | Fela Kuti |
1 | 2 | Marie Curie |
2 | 3 | Albert Einstein |
3 | 4 | Nelson Mandela |
4 | 5 | Mahatma Gandhi |
5 | 6 | Frida Kahlo |
6 | 7 | Winston Churchill |
7 | 8 | Che Guevara |
8 | 9 | Bruce Lee |
9 | 10 | Serena Williams |
Create tasks for the batch endpoint to identify each person’s place of birth:
By using the batch endpoint, we can process multiple tasks efficiently, achieving a 50% cost savings with guaranteed results within 24 hours.
# Create an array of tasks where the prompt is to identify the person's place of birth tasks = [] for index, row in df.iterrows(): person_name = row['prompt'] task = { "custom_id": str(index), # custom_id must be a string "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4o-mini", "temperature": 0.1, "messages": [ { "role": "system", "content": "You are a helpful assistant tasked with identifying the birthplaces of famous individuals." }, { "role": "user", "content": f"""Identify the place of birth for {person_name}. Format the result like `<city/town>,<country>` Do not include any other information""" } ], } } tasks.append(task)
Inspect the first task:
tasks[0]
Output:
{'custom_id': '0', 'method': 'POST', 'url': '/v1/chat/completions', 'body': {'model': 'gpt-4o-mini', 'temperature': 0.1, 'messages': [{'role': 'system', 'content': 'You are a helpful assistant tasked with identifying the birthplaces of famous individuals.'}, {'role': 'user', 'content': 'Identify the place of birth for Fela Kuti. \n Format the result like `<city/town>,`\n Do not include any other information'}]}}
Save the tasks to a JSONL file:
# Creating the jsonl file file_name = "batch_tasks_movies.jsonl" with open(file_name, 'w') as file: for obj in tasks: file.write(json.dumps(obj) + '\n')
Upload the JSONL file to the OpenAI API:
# Upload to OpenAI file API batch_file = client.files.create( file=open(file_name, "rb"), purpose="batch" ) print(batch_file)
Output:
FileObject(id='file-qD2rLGWQOQWvMIqSGc8vru7h', bytes=4896, created_at=1729607937, filename='batch_tasks_movies.jsonl', object='file', purpose='batch', status='processed', status_details=None)
Start the batch job:
# Start batch job batch_job = client.batches.create( input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h" )
Monitor the batch job until completion:
while True: batch_job = client.batches.retrieve(batch_job.id) if batch_job.status != "completed": time.sleep(10) else: print(f"job {batch_job.id} is done") break
Output:
job batch_6717b90334108190970786094985c0de is done
Retrieve and process the results:
# Get completed file results_list = [] result_file_id = batch_job.output_file_id result = client.files.content(result_file_id).content result = result.decode('utf-8') result_entries = result.strip().split("\n") for r in result_entries: results_list.append(json.loads(r)) birthplaces = [] # Reading only the first results for res in results_list: index = res['custom_id'] # Getting index from task id result = res['response']['body']['choices'][0]['message']['content'] row = df.iloc[int(index)] person = row['prompt'] print(f"{person}\t: {result}") print("\n\n----------------------------\n\n") birthplaces.append(result)
Output:
Fela Kuti : Abeokuta,Nigeria ---------------------------- Marie Curie : Warsaw,Poland ---------------------------- Albert Einstein : Ulm,Germany ---------------------------- Nelson Mandela : Mvezo, South Africa ---------------------------- Mahatma Gandhi : Porbandar,India ---------------------------- Frida Kahlo : Coyoacán,Mexico ---------------------------- Winston Churchill : Woodstock, England ---------------------------- Che Guevara : Rosario,Argentina ---------------------------- Bruce Lee : San Francisco,United States ---------------------------- Serena Williams : Saginaw,United States ----------------------------
Add the birthplaces to the DataFrame and display the updated data:
df['Place of Birth'] = birthplaces df
Output:
id | prompt | Place of Birth | |
---|---|---|---|
0 | 1 | Fela Kuti | Abeokuta,Nigeria |
1 | 2 | Marie Curie | Warsaw,Poland |
2 | 3 | Albert Einstein | Ulm,Germany |
3 | 4 | Nelson Mandela | Mvezo, South Africa |
4 | 5 | Mahatma Gandhi | Porbandar,India |
5 | 6 | Frida Kahlo | Coyoacán,Mexico |
6 | 7 | Winston Churchill | Woodstock, England |
7 | 8 | Che Guevara | Rosario,Argentina |
8 | 9 | Bruce Lee | San Francisco,United States |
9 | 10 | Serena Williams | Saginaw,United States |