Web Scraping

The Research Computing team recognizes the ever-growing need for researchers to be able to harvest data from the web and is constantly on the look out for the best tools for your scraping needs.  We currently partner with Mozenda to provide web scraping services for Wharton researchers.  In addition, we also have some suggested frameworks for custom building scrapers when Mozenda just won’t do.

What is Mozenda?

Mozenda (http://www.mozenda.com) is a hosted, WYSIWYG suite of tools that allow you to create and run scraping agents from their cloud. The data is stored on their servers and is downloadable in a variety of standard formats.

How does it work?

Using the Agent Builder (Windows 7/8/10 only!), you can quickly and easily “teach” an agent to perform certain actions on any website and then test and launch the agent to carry out those actions automatically. These actions can be scheduled to run over time and on a set interval.  Once the agent is finished, you can have it send you a notification. You may then download the data in CSV, TSV or XML formats. Check out this overview video:

How do I get access?

E-mail us at research-computing@wharton.upenn.edu for access, and provide the following details:

  • What data will you be scraping? Please include URLs to all sources.
  • How many pages will you be scraping (a rough estimate is fine)?
  • How long will the project last?
  • What budget code should be charged for the pages scraped?
  • Cost is $1 per 1000 pages scraped, in $1 increments, chargeable via journal to any Wharton budget code.

What if I have problems getting started?

Please contact Mozenda support.  They are extremely helpful and have successfully walked a number of Wharton researchers through various scraping scenarios.  They also have a wealth of demos and walkthroughs in their Help Center (http://www.mozenda.com/help/).

Custom Scraping Options

While Mozenda is a powerful, easy to use tool, there are times when your scraping needs are more complex and might require custom programming.  In this case, we say good luck and godspeed! Seriously, while we can’t provide the programming for you, we can give you a list of suggested tools that you might want to try once you have found the right person to work with. Below is a brief list of open-source tools and frameworks that you might want to try:

  • Scrapy – open source scraping framework for Python.
  • scrape.py  –  Python module for scraping content from webpages.
  • ScraperWiki – Techniques and tips for scraping.
  • BeautifulSoup – Python library for quickly building out web scraping projects.