One-Time Environment Set Up
Log on to HPCC via MobaXterm to set up your environment. These steps only need to be done one time. Not per run, just one time.
$ qlogin $ mkvirtualenv --no-site-packages GDELT $ pip install -U pip setuptools $ pip install -U -r /data/GDELT/code2/requirements.txt $ exit
Per-Run Setup
Log on to HPCC via MobaXterm to set up and do a run. These steps need to be done for each run. We’ll call this example run ‘runA’. Copy the code into a new ‘runA’ directory, and list the files:
$ rsync -av /data/GDELT/code2/ runA $ cd runA
Let’s take a look at the files in the new run directory, and get ready for a run!
Preparing a Run
INFILE.csv
INFILE.csv has information unique to each point that you are looking to do queries on.
Edit or upload your INFILE.csv in / to the runA directory. Any file name is fine, INFILE.csv is the default, and can be set in the config.py file. INFILE.csv has the following format, and included are two example unique IDs for this tutorial:
UniqueId | SaveQueryResults | SaveMatrices | SaveEdges | CountryCode | Latitude | Longitude | Distance | In_Miles | ActionGeo_Type | StartDate | EndDate | Frequency |
Swat_College | True | False | True | 39.9062176 | -75.3557416 | 1 | False | 3,4 | 20050101 | 20091231 | annually | |
Vance_Hall | True | False | True | 39.9483068 | -75.1953933 | 1 | False | 3,4 | 20050101 | 20091231 | annually |
PERMUTATIONS.csv
PERMUTATIONS.csv has one line for each ‘permutation’ that you want to run for each Unique ID’s dataset.
Edit or upload a new PERMUTATIONS.csv in / to the runA directory. PERMUTATIONS.csv has the following format, and included are a few example permutations that each Unique ID’s events will be filtered against, and stats run:
Actor1 | Actor1TypeCode | Actor1Type1Code | Actor1Type2Code | Actor1Type3Code | Actor2 | Actor2TypeCode | Actor2Type1Code | Actor2Type2Code | Actor2Type3Code | QuadClass |
BUS | 1 | |||||||||
BUS | 2 | |||||||||
BUS,MNC | ||||||||||
COP,GOV,JUD,MIL,SPY,LEG | 1 |
Note the entirely blank row at the top. That will get you stats for all matching points with no additional filtering.
config.py
The config.py file, just like it sounds, is the configuration for the run. Generally, the defaults are pretty good:
import socket import redis import os # redis server setup redishost = socket.gethostname() redisport = 6379 redispass = 'THISISAPASSWORD-MAKEYOUROWN' r = redis.Redis(host=redishost, port=redisport, db=0, password=redispass) # master proc setup sleep_time = 30 flush_data = True mailto = os.getenv('USER') + '@wharton.upenn.edu' myNULL = 'NaN' output_stats_file = 'output_stats_file.csv' pointfile = 'INFILE.csv' # worker proc setup worker_name = 'G_WORKER' worker_queue = 'short.q' max_workers = 16 max_worker_RAM = '2G' outdir = 'EdgeData' eventdir = 'EventData' overwrite_events = True # debug DEBUG = False
In the ‘# redis server setup’ section, you’ll want to change the ‘redispass’ (so others can’t access / delete / modify your data / results). All the rest are static, and shouldn’t be changed.
In the ‘# master proc setup’ section, you’ll generally not need to change anything, but:
- sleep_time: generally static. Useful to extend if there are problems with the master starting too many workers
- flush_data: when Redis starts, if there’s an existing redis.db file in the redis directory, redis will load it. Setting flush_data to ‘True’ will ‘flush’ (clear) the DB for ‘new work’. Useful for debug and reruns, but generally easier to just create a new ‘run’ directory
- mailto: generally static. This is where ‘something is wrong’ e-mails go, and defaults to your username@wharton.upenn.edu
- myNULL: what we’ll load empty output file values with (NaN is ‘Pythonic’)
- output_stats_file: the main output stats file
- pointfile: the INPUT file (see INFILE.csv, above)
In the ‘# worker proc setup’ section, you might need to change a few things, depending on how ‘big’ the number of events near a point might be (radius & location are the important bits):
- worker_name = ‘G_WORKER’ : no need to change this
- worker_queue = ‘short.q’ : if these are ‘big’ queries, change to ‘all.q’, so workers can run for more than 4 hours
- max_workers = 16 : the max for the ‘short.q’ is 256 (I don’t recommend more than 128, tho), and the max for the ‘all.q’ is 64. The master will only start workers if there are Unique IDs still to be processed … in other words, if max_workers = 100, but there are only 2 Unique IDs pending (as in our example INFILE.csv), the master will only start 2 workers
- max_worker_RAM = ‘2G’ : again, if these are very big queries, you might up this to ’10G’ (or re-think your queries, because they’re going to be huge!)
- outdir = ‘EdgeData’ : the directory where the edge data will be written
- eventdir = ‘EventData’ : the directory where the raw query CSVs will be written
- overwrite_events = True : should I overwrite the raw query CSVs if I’m re-running a job?
DEBUG: if set to True, the master process will write a bit more output. Useful if you need to … debug! 🙂
gdelt_master.sh
#!/bin/bash #$ -N G_MASTER ##$ -l m_mem_free=20G #$ -o G_OUTPUT #$ -j y #$ -m e workon GDELT python gdelt_master.py
Note that ‘-l m_mem_free=20G’ is commented out, with the two pound signs. It’s there as a reminder: if it’s a really big job (hundreds of points), you will need to uncomment that line to allow the master to use more RAM during the run.
If you don’t want to receive an e-mail after job completion, comment out or remove the ‘#$ -m e’ line.
Starting a Run
You’re ready to go! Starting the run is as easy as:
qsub gdelt_master.sh
Monitoring a Run
To see that a run has properly started, use the ‘qstat’ command. You’ll initially see the ‘G_MASTER’ job, and then once that’s up and Redis is loaded, the master will start the requested / required workers, in this case 2 (one for each UniqueID).
$ qstat job-ID prior name user state submit/start at queue jclass slots ja-task-ID ------------------------------------------------------------------------------------------------------------------------------------------------ 777335 0.55637 G_MASTER hughmac r 06/13/2017 10:37:28 all.q@hpcc004.wharton.upenn.ed 1 777336 0.55637 G_WORKER hughmac r 06/13/2017 10:37:34 short.q@hpcc015.wharton.upenn. 1 1 777336 0.55637 G_WORKER hughmac r 06/13/2017 10:37:34 short.q@hpcc015.wharton.upenn. 1 2
You can also look in the log files in G_OUTPUT. A complete run should have output like the following:
$ cat G_OUTPUT/G_MASTER.o777335 got 5 total permutations Starting redis at 2017-06-13-10:37:33 redis up at 2017-06-13-10:37:34 2 points loaded to redis on hpcc004.wharton.upenn.edu (6379) 5 recs loaded to redis on hpcc004.wharton.upenn.edu (6379) saving redis DB (initial state) points to run: 2 total will be: 2 workers: max: 16 running: 0 needed: 2 all jobs done, saving redis writing output Stopping redis at 2017-06-13-10:38:04 done
A complete worker log should look like:
$ cat G_OUTPUT/G_WORKER.o777336.1 SEARCH: UID: Swat_College, Cntry: , Lat: 39.9062176, Lon: -75.3557416, Distance: 1.0 (miles? False), Dates: 20050101 to 20091231 (annually) got 68 records from MySQL reduced to 68 records with haversine AllActors len: 47, AllStartDates len: 5 writing Undirected Graph to EdgeData/Swat_College_Graph.csv writing Directed DiGraph to EdgeData/Swat_College_DiGraph.csv Swat_College done job no more jobs in queue
The workers will actually keep ‘popping’ UniqueIDs from Redis until either there are no more, or the worker is out of time (they recycle after 1 hour to help the queue stay ‘fresh’, and in case of memory leaks).
Gephi Output
After job completion, if you would like to create output appropriate for Gephi input:
$ mv util/* . $ qsub start_redis.sh Your job 777349 ("redis") has been submitted $ cat G_OUTPUT/redis.o777349 Starting redis at 2017-06-13-10:55:03 redis up at 2017-06-13-10:55:04 $ qrsh -now no 'workon GDELT; python export_gephi_setup.py' Vance_Hall Swat_College $ qsub -t 1-2 export_gephi.sh $ qstat job-ID prior name user state submit/start at queue jclass slots ja-task-ID ------------------------------------------------------------------------------------------------------------------------------------------------ 777349 0.55382 redis hughmac r 06/13/2017 10:55:00 all.q@hpcc029.wharton.upenn.ed 1 777358 0.55382 ex_gephi hughmac r 06/13/2017 10:59:17 all.q@hpcc023.wharton.upenn.ed 1 1 777358 0.55382 ex_gephi hughmac r 06/13/2017 10:59:17 all.q@hpcc023.wharton.upenn.ed 1 2 $ qstat job-ID prior name user state submit/start at queue jclass slots ja-task-ID ------------------------------------------------------------------------------------------------------------------------------------------------ 777349 0.56047 redis hughmac r 06/13/2017 10:55:00 all.q@hpcc029.wharton.upenn.ed $ qdel 777349 hughmac has registered the job 777349 for deletion
Note that I waited until the redis log (in G_OUTPUT dir) showed ‘redis up’ before starting the rest of the processes. Output will be in EdgeData (or whatever you set outdir to in config.py), as UniqueID_(Un)Directed_gephi.csv