Scaling GDELT with Redis Master

One-Time Environment Set Up

Log on to HPCC via MobaXterm to set up your environment. These steps only need to be done one time. Not per run, just one time.

Per-Run Setup

Log on to HPCC via MobaXterm to set up and do a run. These steps need to be done for each run. We’ll call this example run ‘runA’. Copy the code into a new ‘runA’ directory, and list the files:

Let’s take a look at the files in the new run directory, and get ready for a run!

Preparing a Run

INFILE.csv

INFILE.csv has information unique to each point that you are looking to do queries on.

Edit or upload your INFILE.csv  in / to the runA directory. Any file name is fine, INFILE.csv is the default, and can be set in the config.py file. INFILE.csv has the following format, and included are two example unique IDs for this tutorial:

UniqueId SaveQueryResults SaveMatrices SaveEdges CountryCode Latitude Longitude Distance In_Miles ActionGeo_Type StartDate EndDate Frequency
Swat_College True False True 39.9062176 -75.3557416 1 False 3,4 20050101 20091231 annually
Vance_Hall True False True 39.9483068 -75.1953933 1 False 3,4 20050101 20091231 annually

PERMUTATIONS.csv

PERMUTATIONS.csv has one line for each ‘permutation’ that you want to run for each Unique ID’s dataset.

Edit or upload a new PERMUTATIONS.csv in / to the runA directory. PERMUTATIONS.csv has the following format, and included are a few example permutations that each Unique ID’s events will be filtered against, and stats run:

Actor1 Actor1TypeCode Actor1Type1Code Actor1Type2Code Actor1Type3Code Actor2 Actor2TypeCode Actor2Type1Code Actor2Type2Code Actor2Type3Code QuadClass
BUS 1
BUS 2
BUS,MNC
COP,GOV,JUD,MIL,SPY,LEG 1

Note the entirely blank row at the top. That will get you stats for all matching points with no additional filtering.

config.py

The config.py file, just like it sounds, is the configuration for the run. Generally, the defaults are pretty good:

In the ‘# redis server setup’ section, you’ll want to change the ‘redispass’ (so others can’t access / delete / modify your data / results). All the rest are static, and shouldn’t be changed.

In the ‘# master proc setup’ section, you’ll generally not need to change anything, but:

  • sleep_time: generally static. Useful to extend if there are problems with the master starting too many workers
  • flush_data: when Redis starts, if there’s an existing redis.db file in the redis directory, redis will load it. Setting flush_data to ‘True’ will ‘flush’ (clear) the DB for ‘new work’. Useful for debug and reruns, but generally easier to just create a new ‘run’ directory
  • mailto: generally static. This is where ‘something is wrong’ e-mails go, and defaults to your username@wharton.upenn.edu
  • myNULL: what we’ll load empty output file values with (NaN is ‘Pythonic’)
  • output_stats_file: the main output stats file
  • pointfile: the INPUT file (see INFILE.csv, above)

In the ‘# worker proc setup’ section, you might need to change a few things, depending on how ‘big’ the number of events near a point might be (radius & location are the important bits):

  • worker_name = ‘G_WORKER’ : no need to change this
  • worker_queue = ‘short.q’ : if these are ‘big’ queries, change to ‘all.q’, so workers can run for more than 4 hours
  • max_workers = 16 : the max for the ‘short.q’ is 256 (I don’t recommend more than 128, tho), and the max for the ‘all.q’ is 64. The master will only start workers if there are Unique IDs still to be processed … in other words, if max_workers = 100, but there are only 2 Unique IDs pending (as in our example INFILE.csv), the master will only start 2 workers
  • max_worker_RAM = ‘2G’ : again, if these are very big queries, you might up this to ’10G’ (or re-think your queries, because they’re going to be huge!)
  • outdir = ‘EdgeData’ : the directory where the edge data will be written
  • eventdir = ‘EventData’ : the directory where the raw query CSVs will be written
  • overwrite_events = True : should I overwrite the raw query CSVs if I’m re-running a job?

DEBUG: if set to True, the master process will write a bit more output. Useful if you need to … debug! 🙂

gdelt_master.sh

Note that ‘-l m_mem_free=20G’ is commented out, with the two pound signs. It’s there as a reminder: if it’s a really big job (hundreds of points), you will need to uncomment that line to allow the master to use more RAM during the run.

If you don’t want to receive an e-mail after job completion, comment out or remove the ‘#$ -m e’ line.

Starting a Run

You’re ready to go! Starting the run is as easy as:

Monitoring a Run

To see that a run has properly started, use the ‘qstat’ command. You’ll initially see the ‘G_MASTER’ job, and then once that’s up and Redis is loaded, the master will start the requested / required workers, in this case 2 (one for each UniqueID).

You can also look in the log files in G_OUTPUT. A complete run should have output like the following:

A complete worker log should look like:

The workers will actually keep ‘popping’ UniqueIDs from Redis until either there are no more, or the worker is out of time (they recycle after 1 hour to help the queue stay ‘fresh’, and in case of memory leaks).

Gephi Output

After job completion, if you would like to create output appropriate for Gephi input:

Note that I waited until the redis log (in G_OUTPUT dir) showed ‘redis up’ before starting the rest of the processes. Output will be in EdgeData (or whatever you set outdir to in config.py), as UniqueID_(Un)Directed_gephi.csv