WRDS Data Directly from Python, R, and MATLAB

Using Wharton Research Computing’s new SAS/SHARE service, you can query WRDS datasets directly from your favorite research software, including Python, R, and MATLAB.

Getting data from the entire suite of WRDS data products and into the software you normally use for your research can be challenging. Many users who are unfamiliar with the SAS programming language spend a lot of time struggling with this unfamiliar software to simply gather a subset of a WRDS dataset, which they then save to their home directory space (wasting disk space resources), and then import the data into their software platform of choice for analysis.

There is a better way!

With guidance from the WRDS team, Wharton Research Computing now has our own SAS/SHARE server, which allows direct query of WRDS data via standard database queries. So instead of the multi-step and multi-software package process, you can work with WRDS data as objects in your language of choice.

Please note: due to its size, the NYSE TAQ dataset is not currently available through R at WRDS.

Prerequisite:

Encrypted Wharton Password

To work securely with SAS/SHARE in any language, you’ll want to use an encrypted version of your Wharton password instead of your normal text-based password. It is a violation of university policy to store your ‘normal’ Wharton text-based password in a file, so this step is required if you wish to use the SAS/SHARE service in a script file. Fortunately, it’s easy!

Log onto the HPCC login node at hpcc.wharton.upenn.edu and type:

qrsh sas -nodms

This will launch the SAS software on a compute node, in interactive mode. After SAS loads, at the `1? ` prompt, enter:

PROC PWENCODE in='thisIsNotMyRealPassword'; run; [ENTER]

Replace my fake password with your real password. Obviously, make sure you are in a private setting when you type this so that no one will be able to observe your real password.

You will be returned a SAS encrypted password, similar to:

{SAS002}1E21C9321B1E016F1D4507E31B005BFC9A1BCFF5B348C073F02EA6ADF72A4139F2

COPY that entire line, ‘{SAS002}’ and all, and save it to a secure location. NOTE: your password may appear on more than one line. If it does, when you save it somewhere remove the line break. When you are done, press Ctrl-c on your keyboard to ‘break out of’ and close the running SAS software.

Set Up Your Research Software

You’re now ready to set up your research software: Python, R, or MATLAB. Each one is a bit different, so I will describe them separately. Please feel free to skip to the section for the software of your choice!

I will also include links to the WRDS documentation for each product, with the caveat that their systems and setups (particularly system names and file paths) are often different than those on the Wharton Research Computing systems.

Setting Up and Using WRDS with Python

This is a one time setup, unless your Wharton password changes.

qlogin
mkvirtualenv WRDS
pip install -U numpy
pip install -U iPython # <-- Optional ... if you like to use it
pip install -U pandas
pip install -U jaydebeapi==0.2.0

cat >> .virtualenvs/WRDS/bin/activate
export PYTHONPATH=$PYTHONPATH:/usr/local/demo/Python/WRDS [ENTER]
Ctrl-d

Set Up Your Login Credentials

Using your Wharton username and the SAS encrypted password you created above, make a new file in the root of your home directory (~), called .wrdsauthrc, substituting in (copy/paste) your encrypted password:

cat > ~/.wrdsauthrc
[credentials]
username=my_username
password={SAS002}DBCC5712369DE1C65B19864C1564FB850F398DCF [ENTER]
# type the following to exit the file write:
Ctrl-d

Using the Python Setup

Testing and Using Interactively
qlogin
workon WRDS
python
>>> import wrds
>>> result = wrds.sql('select * from CRSP.DSF (obs=10)')
>>> print(result.to_string())
      CUSIP  PERMNO  PERMCO  ISSUNO  HEXCD  HSICCD        DATE  BIDLO  ASKHI     PRC    VOL       RET   BID   ASK  SHROUT  CFACPR  CFACSHR OPENPRC NUMTRD      RETX
0  68391610   10000    7952   10396      3    3990  1986-01-07  2.375  2.750 -2.5625   1000       NaN  None  None    3680       1        1    None   None       NaN
1  68391610   10000    7952   10396      3    3990  1986-01-08  2.375  2.625 -2.5000  12800 -0.024390  None  None    3680       1        1    None   None -0.024390
2  68391610   10000    7952   10396      3    3990  1986-01-09  2.375  2.625 -2.5000   1400  0.000000  None  None    3680       1        1    None   None  0.000000
3  68391610   10000    7952   10396      3    3990  1986-01-10  2.375  2.625 -2.5000   8500  0.000000  None  None    3680       1        1    None   None  0.000000
4  68391610   10000    7952   10396      3    3990  1986-01-13  2.500  2.750 -2.6250   5450  0.050000  None  None    3680       1        1    None   None  0.050000
5  68391610   10000    7952   10396      3    3990  1986-01-14  2.625  2.875 -2.7500   2075  0.047619  None  None    3680       1        1    None   None  0.047619
6  68391610   10000    7952   10396      3    3990  1986-01-15  2.750  3.000 -2.8750  22490  0.045455  None  None    3680       1        1    None   None  0.045455
7  68391610   10000    7952   10396      3    3990  1986-01-16  2.875  3.125 -3.0000  10900  0.043478  None  None    3680       1        1    None   None  0.043478
8  68391610   10000    7952   10396      3    3990  1986-01-17  2.875  3.125 -3.0000   8470  0.000000  None  None    3680       1        1    None   None  0.000000
9  68391610   10000    7952   10396      3    3990  1986-01-20  2.875  3.125 -3.0000   1000  0.000000  None  None    3680       1        1    None   None  0.000000
In a qsub Python Script

Create a Python script (test_Py.py, in this example), something like:

import wrds
result = wrds.sql('select * from CRSP.DSF (obs=10)')
print(result.to_string())

Run with something like:

qsub -N wrds_Py -j y -b y 'workon WRDS; python test_Py.py'

Check your output file:

# cat wrds_Py.o479969
 CUSIP PERMNO PERMCO ISSUNO HEXCD HSICCD DATE BIDLO ASKHI PRC VOL RET BID ASK SHROUT CFACPR CFACSHR OPENPRC NUMTRD RETX
0 68391610 10000 7952 10396 3 3990 1986-01-07 2.375 2.750 -2.5625 1000 NaN None None 3680 1 1 None None NaN
1 68391610 10000 7952 10396 3 3990 1986-01-08 2.375 2.625 -2.5000 12800 -0.024390 None None 3680 1 1 None None -0.024390
2 68391610 10000 7952 10396 3 3990 1986-01-09 2.375 2.625 -2.5000 1400 0.000000 None None 3680 1 1 None None 0.000000
3 68391610 10000 7952 10396 3 3990 1986-01-10 2.375 2.625 -2.5000 8500 0.000000 None None 3680 1 1 None None 0.000000
4 68391610 10000 7952 10396 3 3990 1986-01-13 2.500 2.750 -2.6250 5450 0.050000 None None 3680 1 1 None None 0.050000
5 68391610 10000 7952 10396 3 3990 1986-01-14 2.625 2.875 -2.7500 2075 0.047619 None None 3680 1 1 None None 0.047619
6 68391610 10000 7952 10396 3 3990 1986-01-15 2.750 3.000 -2.8750 22490 0.045455 None None 3680 1 1 None None 0.045455
7 68391610 10000 7952 10396 3 3990 1986-01-16 2.875 3.125 -3.0000 10900 0.043478 None None 3680 1 1 None None 0.043478
8 68391610 10000 7952 10396 3 3990 1986-01-17 2.875 3.125 -3.0000 8470 0.000000 None None 3680 1 1 None None 0.000000
9 68391610 10000 7952 10396 3 3990 1986-01-20 2.875 3.125 -3.0000 1000 0.000000 None None 3680 1 1 None None 0.000000

Further Reading: WRDS and Python Documentation from the WRDS Team

Setting Up and Using WRDS with R

This is a one time setup, unless your Wharton password changes.

cat /usr/local/demo/R/WRDS/Rprofile >> ~/.Rprofile
cat /usr/local/demo/R/WRDS/Renviron >> ~/.Renviron

Set Up Your Login Credentials

Using your Wharton username and the SAS encrypted password you created above, edit (using the editor of your choice … in this case I used ‘nano’) the .Rprofile file we just created in the root of your home directory (~), substituting your username and (copy/paste) your encrypted password:

nano ~/.Rprofile
# substitute your username for 'my_username' and your password from above in the 'pass <- ' variable

Using WRDS With R

Testing and Using Interactively
qlogin
R --no-save
> res <- dbSendQuery(wrds,"select date,dji from DJONES.DJDAILY (obs=10)")
> data <- fetch(res, n = -1)
> data 
  date dji
1 1896-05-26 40.94
2 1896-05-27 40.58
3 1896-05-28 40.20
4 1896-05-29 40.63
5 1896-06-01 40.60
6 1896-06-02 40.04
7 1896-06-03 39.77
8 1896-06-04 39.94
9 1896-06-05 40.32
10 1896-06-08 39.81
> q()
In a qsub R Script

Create an R script (test_R.R, in this example), something like:

res <- dbSendQuery(wrds,"select date,dji from DJONES.DJDAILY (obs=10)")
data <- fetch(res, n = -1)
data

Run with something like:

qsub -N wrds_R -j y -b y 'R --no-save < test_R.R'

Check your output file:

# cat wrds_R.o480044
...
Loading required package: DBI
> res <- dbSendQuery(wrds,"select date,dji from DJONES.DJDAILY (obs=10)")
> data <- fetch(res, n = -1)
> data
 date dji
1 1896-05-26 40.94
2 1896-05-27 40.58
3 1896-05-28 40.20
4 1896-05-29 40.63
5 1896-06-01 40.60
6 1896-06-02 40.04
7 1896-06-03 39.77
8 1896-06-04 39.94
9 1896-06-05 40.32
10 1896-06-08 39.81
>

Further Reading: WRDS and R Documentation from the WRDS Team

Setting Up and Using WRDS with MATLAB

NOTE: MATLAB version r2015b (the current latest version) has a bug which prevents DB connections from working. Until r2016b comes out you will need to use r2014b version (also installed in the cluster).

This is a one time setup, unless your Wharton password changes. Copy the WRDS_Connect.m function to someplace in your home directory, so you can edit the file and add credentials:

cp /usr/local/demo/Matlab/WRDS/WRDS_Connect.m ~/

Set Up Your Login Credentials

Using your Wharton username and the SAS encrypted password you created above, edit (using the editor of your choice … in this case I used ‘nano’) the WRDS_Connect.m file we just created in the root of your home directory (~), substituting your username and (copy/paste) your encrypted password:

nano ~/WRDS_Connect.m
# substitute your username for 'my_username' and your password from above for the '{SAS002}...' placeholder password

Using WRDS With MATLAB

Testing and Using Interactively
qlogin
matlab -nodisplay
>> WRDS_Connect
# view all WRDS DataSets
>> schemas(WRDS)
# do a query
>> query = exec(WRDS, ['SELECT * FROM CRSP.DSF (obs=10)']);
>> result = fetch(query);
>> result.Data

ans =

 Columns 1 through 8

 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-07' [2.3750]
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-08' [2.3750]
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-09' [2.3750]
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-10' [2.3750]
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-13' [2.5000]
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-14' [2.6250]
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-15' [2.7500]
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-16' [2.8750]
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-17' [2.8750]
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-20' [2.8750]

 Columns 9 through 17

 [2.7500] [-2.5625] [ 1000] [ NaN] [NaN] [NaN] [3680] [1] [1]
 [2.6250] [-2.5000] [12800] [-0.0244] [NaN] [NaN] [3680] [1] [1]
 [2.6250] [-2.5000] [ 1400] [ 0] [NaN] [NaN] [3680] [1] [1]
 [2.6250] [-2.5000] [ 8500] [ 0] [NaN] [NaN] [3680] [1] [1]
 [2.7500] [-2.6250] [ 5450] [ 0.0500] [NaN] [NaN] [3680] [1] [1]
 [2.8750] [-2.7500] [ 2075] [ 0.0476] [NaN] [NaN] [3680] [1] [1]
 [ 3] [-2.8750] [22490] [ 0.0455] [NaN] [NaN] [3680] [1] [1]
 [3.1250] [ -3] [10900] [ 0.0435] [NaN] [NaN] [3680] [1] [1]
 [3.1250] [ -3] [ 8470] [ 0] [NaN] [NaN] [3680] [1] [1]
 [3.1250] [ -3] [ 1000] [ 0] [NaN] [NaN] [3680] [1] [1]

 Columns 18 through 20

 [NaN] [NaN] [ NaN]
 [NaN] [NaN] [-0.0244]
 [NaN] [NaN] [ 0]
 [NaN] [NaN] [ 0]
 [NaN] [NaN] [ 0.0500]
 [NaN] [NaN] [ 0.0476]
 [NaN] [NaN] [ 0.0455]
 [NaN] [NaN] [ 0.0435]
 [NaN] [NaN] [ 0]
 [NaN] [NaN] [ 0]
In a qsub MATLAB Script

Create a MATLAB script (test_MATLAB.m, in this example), something like:

WRDS_Connect
query = exec(WRDS, ['SELECT * FROM CRSP.DSF (obs=10)']);
result = fetch(query);
result.Data

Run with something like:

qsub -N wrds_MATLAB -j y -b y 'matlab -nodisplay < test_MATLAB.m'

Check your output file:

#  cat wrds-MATLAB.o480051

...

>> >> >>
ans =

 Columns 1 through 7

 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-07'
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-08'
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-09'
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-10'
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-13'
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-14'
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-15'
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-16'
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-17'
 '68391610' [10000] [7952] [10396] [3] [3990] '1986-01-20'

 Columns 8 through 14

 [2.3750] [2.7500] [-2.5625] [ 1000] [ NaN] [NaN] [NaN]
 [2.3750] [2.6250] [-2.5000] [12800] [-0.0244] [NaN] [NaN]
 [2.3750] [2.6250] [-2.5000] [ 1400] [ 0] [NaN] [NaN]
 [2.3750] [2.6250] [-2.5000] [ 8500] [ 0] [NaN] [NaN]
 [2.5000] [2.7500] [-2.6250] [ 5450] [ 0.0500] [NaN] [NaN]
 [2.6250] [2.8750] [-2.7500] [ 2075] [ 0.0476] [NaN] [NaN]
 [2.7500] [ 3] [-2.8750] [22490] [ 0.0455] [NaN] [NaN]
 [2.8750] [3.1250] [ -3] [10900] [ 0.0435] [NaN] [NaN]
 [2.8750] [3.1250] [ -3] [ 8470] [ 0] [NaN] [NaN]
 [2.8750] [3.1250] [ -3] [ 1000] [ 0] [NaN] [NaN]

 Columns 15 through 20

 [3680] [1] [1] [NaN] [NaN] [ NaN]
 [3680] [1] [1] [NaN] [NaN] [-0.0244]
 [3680] [1] [1] [NaN] [NaN] [ 0]
 [3680] [1] [1] [NaN] [NaN] [ 0]
 [3680] [1] [1] [NaN] [NaN] [ 0.0500]
 [3680] [1] [1] [NaN] [NaN] [ 0.0476]
 [3680] [1] [1] [NaN] [NaN] [ 0.0455]
 [3680] [1] [1] [NaN] [NaN] [ 0.0435]
 [3680] [1] [1] [NaN] [NaN] [ 0]
 [3680] [1] [1] [NaN] [NaN] [ 0]

Further Reading: WRDS and MATLAB Documentation from the WRDS Team

Final Thoughts

I hope you find this new resource useful. While the SAS/SHARE service and all of the code and details are powerful and (hopefully) useful, they are also new! If you discover any problems, have questions, or think documentation could be clearer, please don’t hesitate to e-mail research-computing@wharton.upenn.edu.

 

With two decades of experience supporting research and more than a decade at The Wharton School, Hugh enjoys the challenges and rewards of working with world-class researchers doing Amazing Things with research computing. Robust and scalable computational solutions (both on premise and in The Cloud), custom research programming solutions (clever ideas, simple code), and holistic, results-focused approaches to projects are the places where Hugh lives these days. On weekends you're likely to find him running through the woods with a topo map and compass, orienteering.