Dataset of Historical Tweets

As of November 2022 collection has been halted due to changes in the streaming API. Currently, access to this historical dataset is limited to Wharton and Annenberg affiliated research.

In partnership with The Annenberg School, we compiled a dataset of historical Tweets, which is now part of the iWRDS Data Catalog. Collection began in April 2012 and ended in November 2022. The total dataset represents about 1% of the total Twitter volume. Access to the historical dataset can be requested here (PennKey required). The database uses Amazon Web Services’s Athena offering and is queryable using SQL. Queries can be made using the AWS console or using an ODBC connection on the HPCC. See “Using Amazon Athena To Access Data” for detailed information about how to access Wharton-maintained datasets through Amazon Athena once you have been granted access.

Sample Queries

To search for the phrase `wharton school` (case insensitive) within the `extended_full_text` of all tweets between 09/02/2013 and 09/14/2013:

SELECT extended_full_text
FROM "whartweet"."tweets"
WHERE lower(extended_full_text) LIKE '%wharton school%'
        AND CONCAT(year, month, day)
    BETWEEN '20130902'
        AND '20130914'

returns:

 	extended_full_text
1	Wharton school of business at upenn women in business convention? Sign me up! #phillylife

To search for tweets by a list of `user_screen_name` you could do something like:

SELECT user_screen_name,extended_full_text
FROM "whartweet"."tweets"
WHERE user_screen_name in ('UPenn_MedEthics', 'LibrarianUPenn', 'FOCUS_UPenn', 'UPennGCB', 'Upenn_REES', 'CPHIatUPenn')
        AND CONCAT(year, month, day)
    BETWEEN '20180101'
        AND '20191019'

To retrieve the most recent Tweet in English in the database (as of July 29th, 2019):

SELECT created_at,
         extended_full_text,
         id,
         timestamp_ms
FROM "whartweet"."tweets"
WHERE year = cast(YEAR(current_date) AS VARCHAR)
        AND month = lpad(cast(month(now()) AS VARCHAR), 2, '0')
        AND day = lpad(cast(day(now()) AS VARCHAR), 2, '0')
        AND hour = lpad(cast(hour(now()) AS VARCHAR), 2, '0')
        AND lang = 'en'
ORDER BY  timestamp_ms DESC limit 1

You can learn more about lpad, dates and formatting here.

 	created_at	extended_full_text	id	timestamp_ms
1	Mon Jul 29 19:24:24 +0000 2019	@amerix @RobertAlai Nice info,researchers say you're to cut 100% tobacco from your intake, the others you should regulate since it's the amount of carcinogens taken in the body that slowly corrode the mucosal linings of surfaces such as throat, stomach,colon causing #Cancer to develop #JoyceLaboso	1155922038655242240	1564428264665

To chart the use of the term `tariff` since 1/20/2016, you could write a query like:

SELECT year,
         month,
         day,
         CONCAT(month, '/', day, '/', year) AS date, 
         count(*) count
FROM "whartweet"."tweets"
WHERE lower(extended_full_text) LIKE '%tariff%'
        AND CONCAT(year, month, day) >= '20170120'
GROUP BY  year, month, day
ORDER BY  year, month, day

SELECT year,
         month,
         day,
         CONCAT(month, '/', day, '/', year) AS date, 
         count(*) AS count
FROM "whartweet"."tweets"
WHERE lower(extended_full_text) LIKE '%wimbledon%'
        AND CONCAT(year, '-', month, '-', day) >= '2013-01-01'
GROUP BY  year,month,day
ORDER BY  year, month, day

You can select tweets between two date/times like:

SELECT year,
         month,
         day,
         hour,
         CONCAT(month, '/', day, '/', year, ' ' , hour, ':00') AS date, 
         count(*) AS count
FROM "whartweet"."tweets"
WHERE rt_id = 1160256105399967744
        AND CONCAT(year, month, day, hour)
    BETWEEN '2019081016'
        AND '2019081113'
GROUP BY  year, month, day, hour
ORDER BY  year, month, day, hour

Ordering Dates

Currently the `created_at` field is a string, so date ordering must be done on the `timestamp_ms` field. You can limit your search space using the following WHERE statement:

WHERE CONCAT(year, '-', month, '-', day) >= '2016-01-20'

You can convert a Unix Epoch timestamp to a human readable format and visa versa here.

Schema

All data is collected at tweet created_at time, and not all fields are available for all years. Fewer fields are available for tweets created before April 23, 2019.

created_at (string): A datetime string of the tweets posting formatted like `Mon Jul 15 17:08:32 +0000 2019`
timestamp_ms (string): A string representation of a bigint of the number of seconds since the Unix Epoch
id (bigint): A integer representation of the Tweet ID
id_str (string): A string representation of the Tweet ID
extended_full_text (string): The body of the full tweet text (280 characters)
user_id (string): A string representation of the Tweet’s user ID
user_name (string): A string representation of the Tweet’s user name
user_screen_name (string): A string representation of the Tweet’s user screen name
user_followers_count (int): An integer representation of the Tweet’s user followers count
user_location (string): A string representation of the Tweet’s user set location
user_created_at (string): A datetime string representation of the Tweet’s user account creation formatted like `Mon Jul 15 17:08:32 +0000 2019`
user_friends_count (int): An integer representation of the Tweet’s user friends count
user_statuses_count (int): An integer representation of the Tweet’s user total statuses
user_verified (boolean): Set to true if the Tweet’s user account has been verified
source (string): A string representation of the utility used to post the Tweet. (e.g. `<a href=”http://twitter.com/download/android” rel=”nofollow”>Twitter for Android</a>`)
truncated (boolean): Indicates whether the value of the text parameter was truncated
in_reply_to_status_id (bigint): If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s ID
in_reply_to_status_id_str (string): If the represented Tweet is a reply, this field will contain the string representation of the original Tweet’s ID
in_reply_to_user_id_str (string): If the represented Tweet is a reply, this field will contain the string representation of the original Tweet’s user ID
in_reply_to_screen_name (string): If the represented Tweet is a reply, this field will contain the screen name of the original Tweet’s author
is_quote_status (boolean): Indicates whether this is a Quoted Tweet
lang (string): Tweet’s language using bcp47
possibly_sensitive (boolean): This field only surfaces when a Tweet contains a link. The meaning of the field doesn’t pertain to the Tweet content itself, but instead it is an indicator that the URL contained in the Tweet may contain content or media identified as sensitive content.
withheld_copyright (boolean):When present and set to “true”, it indicates that this piece of content has been withheld due to a DMCA complaint
place_name (string): Short human-readable representation of the place’s name
text (string): The possibly truncated text of the tweet. Best to use extended_full_text
long (float): The longitude of the Tweet’s location
lat (float): The latitude of the Tweet’s location
quoted_status_id_str (string): This field only surfaces when the Tweet is a quote Tweet. This is the string representation Tweet ID of the quoted Tweet
place_country_code (string): Shortened country code representing the country containing this place
rt_* Any column prefixed with `rt_` will contain information of the parent Tweet. Only present for Retweets.
year (string): A partition field for the Tweet’s created_at year.
month (string): A partition field for the Tweet’s created_at month.
day (string): A partition field for the Tweet’s created_at day.
hour (string): A partition field for the Tweet’s created_at hour.

Stats and Info

3,500,000 tweets collected daily
Over 10TB of total data and growing
Corpus language is not limited to English
~43% of total Tweets are in English
Stored in Parquet format for faster and more efficient queries
Data is partitioned by year/month/day/hour. When these are used in the `where` clause of a query, search space is greatly reduced.
Data is available to be queried (freshness) within ~300 seconds of Tweet publishing
Unicode/Emoji supports since mid 2018.
There is a nominal cost associated with each query and a payment solution must be worked out on a case by case basis.