I frequently sing the praises of Python as the go-to language for research programming. Python is extremely fast, in terms of human development time. There are typically more CPU cores available versus research assistants with specific programming skills. On that note, Python is also extremely fast, in terms of compute time, thanks to the already optimized and compiled C/C++ code in modules such as numpy and scipy.
Enter IPython (soon to be rename Jupyter). IPython provides a number of improvements over vanilla Python. My favorites are the improved interactive shell for the terminal and web-enabled “notebooks” that allow a better way of quickly developing code.
IPython also allows for easily scaling up code in a parallel fashion. When working with IPython, one can start a “cluster” of “engines.” These engines are managed by a “controller.” Typically, one would launch one engine per CPU on a system. The engines then accept work from the “client” IPython shell. This is all automated with the IPython ipcluster command on a personal system. But how do we take advantage of this feature in an HPC environment that has many many more CPUs behind a job scheduler? When submitting non-interactive jobs to a job queue, we cannot predict when and what compute nodes of the cluster will be provided. Furthermore, we cannot have different jobs stomping on the setup and tear-down of our cluster of IPython engines.
So, we need a scriptable setup that allows for dynamic resources and simultaneous, non-conflicting clusters of engines. IPython does provide “profile” support for PBS and SGE job schedulers. I have adapted an SGE profile setup and a virtualenv to facilitate a single-script solution on the Wharton HPCC system.
#!/usr/bin/env python ### a single script solution to running parallel IPython with Univa Grid Engine (SGE) ### 2015-06-29 Gavin Burris, Wharton Research Computing
Let’s first configure the profile and virtualenv to make this possible, with the following terminal commands.
### create a virtualenv: # qlogin # virtualenv ~/projectA # cd ~/projectA && source ~/projectA/bin/activate # pip install ipython zmq virtualenvwrapper --upgrade # ipython profile create sge --parallel # echo "c.IPEngineApp.wait_for_url_file = 60" >>~/.ipython/profile_sge/ipengine_config.py # echo "c.EngineFactory.timeout = 20" >>~/.ipython/profile_sge/ipengine_config.py # echo "c.HubFactory.ip = '*'" >>~/.ipython/profile_sge/ipcontroller_config.py # echo "c.HubFactory.engine_ip = '*'" >>~/.ipython/profile_sge/ipcontroller_config.py # echo "c.IPClusterStart.controller_launcher_class = 'SGE'" >>~/.ipython/profile_sge/ipcluster_config.py # echo "c.IPClusterEngines.engine_launcher_class = 'SGE'" >>~/.ipython/profile_sge/ipcluster_config.py # exit
Once we have set the profile and virtualenv in our HPCC account, we get to the meat of the python script. Most of this code is wrapper for starting and stopping the IPython cluster engines in a fail-proof way.
### required modules import subprocess import os import sys import time from IPython.parallel import Client ### start IPython cluster def waitforp(mesg): while True: line = p.stderr.readline() #print(line.strip()) if mesg in line: break jobname = os.environ['JOB_NAME'] jobid = str(os.environ['JOB_ID']) cluster = jobname + jobid engines = sys.argv[1] pcmd = "ipcluster", "start", "--profile=sge", "--cluster-id=" + cluster, "-n", engines print "Starting IPython cluster", cluster, "with", engines, "engines:", ' '.join(pcmd) p = subprocess.Popen(pcmd, env=os.environ.copy(), stdout=subprocess.PIPE, stderr=subprocess.PIPE) waitforp("Engines appear to have started successfully") rc = None while rc is None: try: rc = Client(profile='sge', cluster_id=cluster) except: print "Waiting for ipcontroller..." time.sleep(10) pass while len(rc.ids) < int(engines): print "Waiting for ipengines..." time.sleep(10) #print rc.ids dview = rc[:] # use all engines ### do some work ala http://ipython.org/ipython-doc/stable/parallel/parallel_multiengine.html#quick-and-easy-parallelism serial_result = map(lambda x:x**10, range(32)) parallel_result = dview.map_sync(lambda x: x**10, range(32)) print serial_result print serial_result==parallel_result ### stop IPython cluster qcmd = "ipcluster", "stop", "--profile=sge", "--cluster-id=" + cluster print "Stopping IPython cluster", cluster + ":", ' '.join(qcmd) q = subprocess.Popen(qcmd, env=os.environ.copy(), stdout=subprocess.PIPE, stderr=subprocess.PIPE) waitforp("Removing pid file")
Once the above is saved to the ipcode.py script, we can submit it to grid engine. This script is a template for whatever work is required in the above “do some work” section. In this example, we are simply comparing the results returned from a local versus parallel run of the lambda function that takes x to the power of ten. Please see the IPython Parallel Multiengine documentation for different methods of dividing work across engines, including farming out python function calls.
This single script can then be submitted to the job queue with qsub. Note the “8” in the command line. This tells our script to run eight engines. This one ipcode client script, one ipcontroller and eight ipengines is a total of ten slots in the queue.
### submit job with: # echo "source ~/projectA/bin/activate && ipython ipcode.py 8" |qsub -j y -N ipcode # OR # qsub ipcode.sh 8
Please see /usr/local/demo/Python/IPython/ on HPCC for the demo code from this post.