Scaling GDELT with Redis Master

One-Time Environment Set Up

Log on to HPCC via MobaXterm to set up your environment. These steps only need to be done one time. Not per run, just one time.

$ qlogin
$ mkvirtualenv --no-site-packages GDELT
$ pip install -U pip setuptools
$ pip install -U -r /data/GDELT/code2/requirements.txt
$ exit

Per-Run Setup

Log on to HPCC via MobaXterm to set up and do a run. These steps need to be done for each run. We’ll call this example run ‘runA’. Copy the code into a new ‘runA’ directory, and list the files:

$ rsync -av /data/GDELT/code2/ runA
$ cd runA

Let’s take a look at the files in the new run directory, and get ready for a run!

Preparing a Run

INFILE.csv

INFILE.csv has information unique to each point that you are looking to do queries on.

Edit or upload your INFILE.csv  in / to the runA directory. Any file name is fine, INFILE.csv is the default, and can be set in the config.py file. INFILE.csv has the following format, and included are two example unique IDs for this tutorial:

UniqueId SaveQueryResults SaveMatrices SaveEdges CountryCode Latitude Longitude Distance In_Miles ActionGeo_Type StartDate EndDate Frequency
Swat_College True False True 39.9062176 -75.3557416 1 False 3,4 20050101 20091231 annually
Vance_Hall True False True 39.9483068 -75.1953933 1 False 3,4 20050101 20091231 annually

PERMUTATIONS.csv

PERMUTATIONS.csv has one line for each ‘permutation’ that you want to run for each Unique ID’s dataset.

Edit or upload a new PERMUTATIONS.csv in / to the runA directory. PERMUTATIONS.csv has the following format, and included are a few example permutations that each Unique ID’s events will be filtered against, and stats run:

Actor1 Actor1TypeCode Actor1Type1Code Actor1Type2Code Actor1Type3Code Actor2 Actor2TypeCode Actor2Type1Code Actor2Type2Code Actor2Type3Code QuadClass
BUS 1
BUS 2
BUS,MNC
COP,GOV,JUD,MIL,SPY,LEG 1

Note the entirely blank row at the top. That will get you stats for all matching points with no additional filtering.

config.py

The config.py file, just like it sounds, is the configuration for the run. Generally, the defaults are pretty good:

import socket
import redis
import os

# redis server setup
redishost = socket.gethostname()
redisport = 6379
redispass = 'THISISAPASSWORD-MAKEYOUROWN'
r = redis.Redis(host=redishost, port=redisport, db=0, password=redispass)

# master proc setup
sleep_time = 30
flush_data = True
mailto = os.getenv('USER') + '@wharton.upenn.edu'
myNULL = 'NaN'
output_stats_file = 'output_stats_file.csv'
pointfile = 'INFILE.csv'

# worker proc setup
worker_name = 'G_WORKER'
worker_queue = 'short.q'
max_workers = 16
max_worker_RAM = '2G'
outdir = 'EdgeData'
eventdir = 'EventData'
overwrite_events = True

# debug
DEBUG = False

In the ‘# redis server setup’ section, you’ll want to change the ‘redispass’ (so others can’t access / delete / modify your data / results). All the rest are static, and shouldn’t be changed.

In the ‘# master proc setup’ section, you’ll generally not need to change anything, but:

  • sleep_time: generally static. Useful to extend if there are problems with the master starting too many workers
  • flush_data: when Redis starts, if there’s an existing redis.db file in the redis directory, redis will load it. Setting flush_data to ‘True’ will ‘flush’ (clear) the DB for ‘new work’. Useful for debug and reruns, but generally easier to just create a new ‘run’ directory
  • mailto: generally static. This is where ‘something is wrong’ e-mails go, and defaults to your username@wharton.upenn.edu
  • myNULL: what we’ll load empty output file values with (NaN is ‘Pythonic’)
  • output_stats_file: the main output stats file
  • pointfile: the INPUT file (see INFILE.csv, above)

In the ‘# worker proc setup’ section, you might need to change a few things, depending on how ‘big’ the number of events near a point might be (radius & location are the important bits):

  • worker_name = ‘G_WORKER’ : no need to change this
  • worker_queue = ‘short.q’ : if these are ‘big’ queries, change to ‘all.q’, so workers can run for more than 4 hours
  • max_workers = 16 : the max for the ‘short.q’ is 256 (I don’t recommend more than 128, tho), and the max for the ‘all.q’ is 64. The master will only start workers if there are Unique IDs still to be processed … in other words, if max_workers = 100, but there are only 2 Unique IDs pending (as in our example INFILE.csv), the master will only start 2 workers
  • max_worker_RAM = ‘2G’ : again, if these are very big queries, you might up this to ’10G’ (or re-think your queries, because they’re going to be huge!)
  • outdir = ‘EdgeData’ : the directory where the edge data will be written
  • eventdir = ‘EventData’ : the directory where the raw query CSVs will be written
  • overwrite_events = True : should I overwrite the raw query CSVs if I’m re-running a job?

DEBUG: if set to True, the master process will write a bit more output. Useful if you need to … debug! 🙂

gdelt_master.sh

#!/bin/bash
#$ -N G_MASTER
##$ -l m_mem_free=20G
#$ -o G_OUTPUT
#$ -j y
#$ -m e
workon GDELT
python gdelt_master.py

Note that ‘-l m_mem_free=20G’ is commented out, with the two pound signs. It’s there as a reminder: if it’s a really big job (hundreds of points), you will need to uncomment that line to allow the master to use more RAM during the run.

If you don’t want to receive an e-mail after job completion, comment out or remove the ‘#$ -m e’ line.

Starting a Run

You’re ready to go! Starting the run is as easy as:

qsub gdelt_master.sh

Monitoring a Run

To see that a run has properly started, use the ‘qstat’ command. You’ll initially see the ‘G_MASTER’ job, and then once that’s up and Redis is loaded, the master will start the requested / required workers, in this case 2 (one for each UniqueID).

$ qstat
job-ID     prior   name       user         state submit/start at     queue                          jclass                         slots ja-task-ID
------------------------------------------------------------------------------------------------------------------------------------------------
    777335 0.55637 G_MASTER   hughmac      r     06/13/2017 10:37:28 all.q@hpcc004.wharton.upenn.ed                                    1
    777336 0.55637 G_WORKER   hughmac      r     06/13/2017 10:37:34 short.q@hpcc015.wharton.upenn.                                    1 1
    777336 0.55637 G_WORKER   hughmac      r     06/13/2017 10:37:34 short.q@hpcc015.wharton.upenn.                                    1 2

You can also look in the log files in G_OUTPUT. A complete run should have output like the following:

$ cat G_OUTPUT/G_MASTER.o777335
got 5 total permutations
Starting redis at 2017-06-13-10:37:33
redis up at 2017-06-13-10:37:34
2 points loaded to redis on hpcc004.wharton.upenn.edu (6379)
5 recs loaded to redis on hpcc004.wharton.upenn.edu (6379)
saving redis DB (initial state)
points to run: 2 total will be: 2
workers: max: 16 running: 0 needed: 2
all jobs done, saving redis
writing output
Stopping redis at 2017-06-13-10:38:04
done

A complete worker log should look like:

$ cat G_OUTPUT/G_WORKER.o777336.1
SEARCH: UID: Swat_College, Cntry: , Lat: 39.9062176, Lon: -75.3557416, Distance: 1.0 (miles? False), Dates: 20050101 to 20091231 (annually)
got 68 records from MySQL
reduced to 68 records with haversine
AllActors len: 47, AllStartDates len: 5
writing Undirected Graph to EdgeData/Swat_College_Graph.csv
writing Directed DiGraph to EdgeData/Swat_College_DiGraph.csv
Swat_College done job
no more jobs in queue

The workers will actually keep ‘popping’ UniqueIDs from Redis until either there are no more, or the worker is out of time (they recycle after 1 hour to help the queue stay ‘fresh’, and in case of memory leaks).

Gephi Output

After job completion, if you would like to create output appropriate for Gephi input:

$ mv util/* .
$ qsub start_redis.sh
Your job 777349 ("redis") has been submitted
$ cat G_OUTPUT/redis.o777349
Starting redis at 2017-06-13-10:55:03
redis up at 2017-06-13-10:55:04
$ qrsh -now no 'workon GDELT; python export_gephi_setup.py'
Vance_Hall
Swat_College
$ qsub -t 1-2 export_gephi.sh
$ qstat
job-ID     prior   name       user         state submit/start at     queue                          jclass                         slots ja-task-ID
------------------------------------------------------------------------------------------------------------------------------------------------
    777349 0.55382 redis      hughmac      r     06/13/2017 10:55:00 all.q@hpcc029.wharton.upenn.ed                                    1
    777358 0.55382 ex_gephi   hughmac      r     06/13/2017 10:59:17 all.q@hpcc023.wharton.upenn.ed                                    1 1
    777358 0.55382 ex_gephi   hughmac      r     06/13/2017 10:59:17 all.q@hpcc023.wharton.upenn.ed                                    1 2
$ qstat
job-ID     prior   name       user         state submit/start at     queue                          jclass                         slots ja-task-ID
------------------------------------------------------------------------------------------------------------------------------------------------
    777349 0.56047 redis      hughmac      r     06/13/2017 10:55:00 all.q@hpcc029.wharton.upenn.ed    
$ qdel 777349
hughmac has registered the job 777349 for deletion

Note that I waited until the redis log (in G_OUTPUT dir) showed ‘redis up’ before starting the rest of the processes. Output will be in EdgeData (or whatever you set outdir to in config.py), as UniqueID_(Un)Directed_gephi.csv