Systematic Information Gathering with AI: Beyond Manual Collection

The Promise of Structured Data Collection

Information gathering across large, distributed sources has long been challenging to systematize effectively. Whether you're mapping expertise, compiling directories, or surveying landscapes in any domain, traditional automated approaches often struggle with inconsistent data extraction, parsing errors, and poor data quality.

Recent advances in AI capabilities change this equation fundamentally. By combining key technologies - structured outputs and web search - we can now build systematic information gathering systems that maintain data quality and consistency at scale.

The Technical Foundation

The breakthrough comes from pairing OpenAI's structured output capabilities with web search tools and robust data modeling. Structured outputs ensure every response matches a predefined schema, eliminating the parsing errors that plague traditional AI text generation. Web search provides real-time access to current information across the internet. And Pydantic - an open-source Python library - provides the schema definition and validation layer that makes reliable data extraction possible.

This combination enables reliable, large-scale data collection that would be impractical to do manually.

A Concrete Example: Faculty Mapping

To demonstrate this approach, I built a tool for mapping faculty expertise across university departments. The challenge here mirrors many information gathering tasks: finding people working in specific areas across distributed, heterogeneous sources.

Why This Matters

Traditional discovery involves:

  • Manually browsing department websites
  • Checking individual faculty pages
  • Searching publication databases
  • Relying on personal networks

This approach misses interdisciplinary researchers and those applying techniques in unexpected contexts - exactly the people you often most want to find.

Clear Data Structures with Pydantic

At the core of the system are structured data models defined with Pydantic. This open-source Python library allows us to specify exactly what information we want and automatically validates that every response matches our schema. Every faculty entry has the same structure, making downstream analysis straightforward.

class Person(BaseModel):
    name: str
    school: str
    department: str
    title: str
    citation: str
    reason: str

class PeopleResponse(BaseModel):
    people: List[Person]

This approach ensures every researcher entry contains the same fields, making downstream analysis straightforward.

School-by-School Coverage

Rather than trying to scrape an entire university all at once, the tool works through each school or department individually. This has a few benefits:

  • The scope stays manageable
  • We avoid accidentally skipping units
  • It's easy to scale or parallelize
  • Prompts can be customized to each school's context
PENN_SCHOOLS = [
    "College of Arts and Sciences",
    "School of Engineering and Applied Science", 
    "Wharton School",
    "School of Nursing",
    "Perelman School of Medicine",
    # ... additional schools
]

for school in args.schools:
    logging.info(f"Searching {school}")
    prompt = build_prompt(school, args.max_results)
    people = run_search(client, prompt, model=args.model)

Reliable Structured Output

The tool relies on OpenAI's structured output, which avoids the usual messiness of parsing JSON from text. This guarantees clean, usable data-no broken parsing, no malformed responses.

def run_search(client, prompt, model="gpt-4o-mini"):
    resp = client.responses.parse(
        model=model,
        tools=[{"type": "web_search"}],
        input=prompt,
        text_format=PeopleResponse,
        max_output_tokens=16384,
    )
    
    return [person.model_dump() for person in resp.output_parsed.people]

The responses.parse() method with text_format=PeopleResponse ensures the API returns data that exactly matches our Pydantic schema.

Real-World Results

When tested at the University of Pennsylvania, the tool surfaced a broad range of faculty working on generative AI, well beyond just the computer science department. Here's a sample of the results:

Name School Department Title Reason
Sudeep Bhatia College of Arts and Sciences Psychology Associate Professor Researches human behavior using AI methodologies
Andrew Zahrt College of Arts and Sciences Chemistry Assistant Professor Studies AI's impact on chemistry and health technology
Ethan Mollick Wharton School Management Associate Professor; Co-Director, Generative AI Lab Co-directs the Generative AI Lab, focusing on AI's impact on education and work
Kevin B. Johnson Perelman School of Medicine Biomedical Informatics University Professor Leads the AI-4-AI Lab, focusing on AI applications in ambulatory care
Ryan S. Baker Graduate School of Education Learning Analytics and AI Professor Leads research on big data in education and integrates AI into teaching
Marion Leary School of Nursing Office of Nursing Research Director of Innovation Leads innovation initiatives integrating AI into nursing research and education
Matias del Campo School of Design Weitzman School of Design Associate Professor of Architecture Authored books exploring AI's impact on architecture

The tool found 60+ faculty across 8 different schools, demonstrating how the systematic approach surfaces expertise well beyond obvious departments. Each entry includes specific reasoning for inclusion, providing context about why each researcher was identified.

This was accomplished using gpt-4o-mini, OpenAI's fastest and most cost-effective model. More advanced reasoning models like o1 may deliver even better results for complex categorization and edge case identification.

Beyond Faculty Mapping: Broader Applications

This technical framework applies to many systematic information gathering challenges:

Market Research

  • "List companies in {industry} using {technology} for {application}"
  • Map competitive landscapes, identify potential partners, track adoption patterns

Expert Networks

  • "Find professionals with experience in {domain} at companies in {location}"
  • Build speaker pools, identify consultants, map industry expertise

Regulatory Compliance

  • "Identify organizations affected by {regulation} in {jurisdiction}"
  • Map compliance requirements, find implementation examples

Grant and Funding Discovery

  • "List research groups working on {topic} that received funding from {agency}"
  • Identify collaboration opportunities, track research directions

Vendor and Service Mapping

  • "Find companies providing {service} to {industry} in {region}"
  • Build vendor databases, identify service gaps

Final Thoughts

This isn't about replacing human expertise or eliminating the need for deeper research. It's about systematically covering ground that would be impractical to survey manually, giving you a comprehensive starting point for more focused investigation.

The combination of structured outputs and web search makes previously difficult information gathering tasks routine - whether you're mapping academic expertise, competitive landscapes, or any other distributed knowledge domain.

Full Code

#!/usr/bin/env python3
"""
UPenn Generative AI Faculty Search Demo (per‑school version)
===========================================================

Iterates over Penn's schools, issues one web‑enabled OpenAI request per school,
then saves a combined roster of people who work on generative AI.  The output
can be **JSON** or **CSV** and now contains a **reason** column that briefly
explains why each person qualified.

Run examples
------------
$ python upenn_ai_search.py                       # → people.json
$ python upenn_ai_search.py --format csv \
      --output penn_ai_people.csv --max-results 8

Set OPENAI_API_KEY in your environment before running.
"""

import os
import json
import csv
import logging
import argparse
from pathlib import Path
from typing import List

from openai import OpenAI
from pydantic import BaseModel

# ---------------------------------------------------------------------------
# Pydantic models for structured output
# ---------------------------------------------------------------------------

class Person(BaseModel):
    name: str
    school: str
    department: str
    title: str
    citation: str
    reason: str

class PeopleResponse(BaseModel):
    people: List[Person]

# ---------------------------------------------------------------------------
# Constant list of Penn schools (override with --schools if desired)
# ---------------------------------------------------------------------------

PENN_SCHOOLS = [
    "College of Arts and Sciences",
    "School of Engineering and Applied Science",
    "Wharton School",
    "School of Nursing",
    "School of Design",
    "School of Law",
    "Perelman School of Medicine",
    "Graduate School of Education",
    "School of Social Policy & Practice",
    "School of Veterinary Medicine",
    "Annenberg School for Communication",
    "School of Applied Science and Technology",
]

# ---------------------------------------------------------------------------
# Logging helper
# ---------------------------------------------------------------------------

def setup_logging():
    """Configure simple console logging."""
    logging.basicConfig(level=logging.INFO, format="%(levelname)s | %(message)s")

# ---------------------------------------------------------------------------
# Prompt builder
# ---------------------------------------------------------------------------

def build_prompt(school, max_results):
    """Return a prompt that targets *one* Penn school.

    The model must reply with JSON containing a **reason** field for each
    person explaining why they are included (e.g. "published LLM paper",
    "leads generative‑AI lab").
    """
    return f"""

Task
----
List up to {max_results} individuals at the **{school}** of the University of
Pennsylvania who actively work with generative artificial intelligence.

For each person return an object with:
  "name"        - full name
  "school"      - always "{school}"
  "department"  - department, center, or lab
  "title"       - current title
  "citation"    - public URL proving the information
  "reason"      - one short sentence on why this person was included

Return exactly one JSON document, nothing else:
The response will be structured as JSON with a "people" array containing these objects.
"""

# ---------------------------------------------------------------------------
# OpenAI call
# ---------------------------------------------------------------------------

def run_search(client, prompt, model="gpt-4o-mini"):
    """Call Responses API with structured output and web_search."""
    resp = client.responses.parse(
        model=model,
        tools=[{"type": "web_search"}],
        input=prompt,
        text_format=PeopleResponse,
        max_output_tokens=16384,
    )
    
    if not resp.output_parsed:
        logging.warning("Empty response - skipping")
        return []
    
    return [person.model_dump() for person in resp.output_parsed.people]

# ---------------------------------------------------------------------------
# Saving helpers
# ---------------------------------------------------------------------------

def save_json(people, filename):
    """Write all records to a single JSON file."""
    with open(filename, "w", encoding="utf-8") as fh:
        json.dump({"people": people}, fh, indent=2, ensure_ascii=False)
    logging.info(f"Saved {len(people)} people to {filename} (json)")


def save_csv(people, filename):
    """Append records to CSV, adding header on first write."""
    fieldnames = ["name", "school", "department", "title", "citation", "reason"]
    write_header = not Path(filename).exists()
    with open(filename, "a", newline="", encoding="utf-8") as fh:
        writer = csv.DictWriter(fh, fieldnames=fieldnames)
        if write_header:
            writer.writeheader()
        for person in people:
            writer.writerow({key: person.get(key, "") for key in fieldnames})
    logging.info(f"Appended {len(people)} rows to {filename} (csv)")

# ---------------------------------------------------------------------------
# Main routine
# ---------------------------------------------------------------------------

def main():
    parser = argparse.ArgumentParser(description="UPenn generative‑AI search demo")
    parser.add_argument("--max-results", type=int, default=12,
                        help="maximum people per school")
    parser.add_argument("--format", choices=["json", "csv"], default="json",
                        help="output format (default json)")
    parser.add_argument("--output", default="people.json",
                        help="output filename")
    parser.add_argument("--model", default="gpt-4o-mini",
                        help="OpenAI model name")
    parser.add_argument("--schools", nargs="*", default=PENN_SCHOOLS,
                        help="override list of schools to search")
    args = parser.parse_args()

    setup_logging()

    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise SystemExit("OPENAI_API_KEY not found")

    client = OpenAI(api_key=api_key)

    all_people = []
    for school in args.schools:
        logging.info(f"Searching {school}")
        prompt = build_prompt(school, args.max_results)
        people = run_search(client, prompt, model=args.model)
        if people:
            all_people.extend(people)
        else:
            logging.info(f"No results for {school}")

    if not all_people:
        logging.info("No people found across requested schools")
        return

    if args.format == "json":
        save_json(all_people, args.output)
    else:
        save_csv(all_people, args.output)

    logging.info("Done - results written to %s", args.output)


if __name__ == "__main__":
    main()