Dataset of Historical Tweets

As of November 2022 collection has been halted due to changes in the streaming API. Currently, access to this historical dataset is limited to Wharton and Annenberg affiliated research.

graph
Daily volume (as of 7/28/19)

In partnership with The Annenberg School, we compiled a dataset of historical Tweets, which is now part of the iWRDS Data Catalog. Collection began in April 2012 and ended in November 2022. The total dataset represents about 1% of the total Twitter volume. Access to the historical dataset can be requested here (PennKey required). The database uses Amazon Web Services’s Athena offering and is queryable using SQL. Queries can be made using the AWS console or using an ODBC connection on the HPCC. See “Using Amazon Athena To Access Data” for detailed information about how to access Wharton-maintained datasets through Amazon Athena once you have been granted access.

Sample Queries

To search for the phrase `wharton school` (case insensitive) within the `extended_full_text` of all tweets between 09/02/2013 and 09/14/2013:

SELECT extended_full_text
FROM "whartweet"."tweets"
WHERE lower(extended_full_text) LIKE '%wharton school%'
        AND CONCAT(year, month, day)
    BETWEEN '20130902'
        AND '20130914'

returns:

 	extended_full_text
1	Wharton school of business at upenn women in business convention? Sign me up! #phillylife

To search for tweets by a list of `user_screen_name` you could do something like:

SELECT user_screen_name,extended_full_text
FROM "whartweet"."tweets"
WHERE user_screen_name in ('UPenn_MedEthics', 'LibrarianUPenn', 'FOCUS_UPenn', 'UPennGCB', 'Upenn_REES', 'CPHIatUPenn')
        AND CONCAT(year, month, day)
    BETWEEN '20180101'
        AND '20191019'

To retrieve the most recent Tweet in English in the database  (as of July 29th, 2019):

SELECT created_at,
         extended_full_text,
         id,
         timestamp_ms
FROM "whartweet"."tweets"
WHERE year = cast(YEAR(current_date) AS VARCHAR)
        AND month = lpad(cast(month(now()) AS VARCHAR), 2, '0')
        AND day = lpad(cast(day(now()) AS VARCHAR), 2, '0')
        AND hour = lpad(cast(hour(now()) AS VARCHAR), 2, '0')
        AND lang = 'en'
ORDER BY  timestamp_ms DESC limit 1

You can learn more about lpad, dates and formatting here.

 	created_at	extended_full_text	id	timestamp_ms
1	Mon Jul 29 19:24:24 +0000 2019	@amerix @RobertAlai Nice info,researchers say you're to cut 100% tobacco from your intake, the others you should regulate since it's the amount of carcinogens taken in the body that slowly corrode the mucosal linings of surfaces such as throat, stomach,colon causing #Cancer to develop #JoyceLaboso	1155922038655242240	1564428264665

To chart the use of the term `tariff` since  1/20/2016, you could write a query like:

SELECT year,
         month,
         day,
         CONCAT(month, '/', day, '/', year) AS date, 
         count(*) count
FROM "whartweet"."tweets"
WHERE lower(extended_full_text) LIKE '%tariff%'
        AND CONCAT(year, month, day) >= '20170120'
GROUP BY  year, month, day
ORDER BY  year, month, day

chart

SELECT year,
         month,
         day,
         CONCAT(month, '/', day, '/', year) AS date, 
         count(*) AS count
FROM "whartweet"."tweets"
WHERE lower(extended_full_text) LIKE '%wimbledon%'
        AND CONCAT(year, '-', month, '-', day) >= '2013-01-01'
GROUP BY  year,month,day
ORDER BY  year, month, day

tweet-chart

You can select tweets between two date/times like:

SELECT year,
         month,
         day,
         hour,
         CONCAT(month, '/', day, '/', year, ' ' , hour, ':00') AS date, 
         count(*) AS count
FROM "whartweet"."tweets"
WHERE rt_id = 1160256105399967744
        AND CONCAT(year, month, day, hour)
    BETWEEN '2019081016'
        AND '2019081113'
GROUP BY  year, month, day, hour
ORDER BY  year, month, day, hour

chart

Ordering Dates

Currently the `created_at` field is a string, so date ordering must be done on the `timestamp_ms` field. You can limit your search space using the following WHERE statement:

WHERE CONCAT(year, '-', month, '-', day) >= '2016-01-20'

You can convert a Unix Epoch timestamp to a human readable format and visa versa here.

Schema

All data is collected at tweet created_at time, and not all fields are available for all years. Fewer fields are available for tweets created before April 23, 2019.

  1. created_at  (string): A datetime string of the tweets posting formatted like `Mon Jul 15 17:08:32 +0000 2019`
  2. timestamp_ms (string): A string representation of a bigint of the number of seconds since the Unix Epoch
  3. id (bigint): A integer representation of the Tweet ID
  4. id_str (string): A string representation of the Tweet ID
  5. extended_full_text (string): The body of the full tweet text (280 characters)
  6. user_id (string): A string representation of the Tweet’s user ID
  7. user_name (string): A string representation of the Tweet’s user name
  8. user_screen_name (string): A string representation of the Tweet’s user screen name
  9. user_followers_count (int): An integer representation of the Tweet’s user followers count
  10. user_location (string): A string representation of the Tweet’s user set location
  11. user_created_at (string): A datetime string representation of the Tweet’s user account creation formatted like `Mon Jul 15 17:08:32 +0000 2019`
  12. user_friends_count (int): An integer representation of the Tweet’s user friends count
  13. user_statuses_count (int): An integer representation of the Tweet’s user total statuses
  14. user_verified (boolean): Set to true if the Tweet’s user account has been verified
  15. source (string): A string representation of the utility used to post the Tweet. (e.g. `<a href=”http://twitter.com/download/android” rel=”nofollow”>Twitter for Android</a>`)
  16. truncated (boolean): Indicates whether the value of the text parameter was truncated
  17. in_reply_to_status_id (bigint): If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s ID
  18. in_reply_to_status_id_str (string): If the represented Tweet is a reply, this field will contain the string representation of the original Tweet’s ID
  19. in_reply_to_user_id_str (string): If the represented Tweet is a reply, this field will contain the string representation of the original Tweet’s user ID
  20. in_reply_to_screen_name (string): If the represented Tweet is a reply, this field will contain the screen name of the original Tweet’s author
  21. is_quote_status (boolean): Indicates whether this is a Quoted Tweet
  22. lang (string): Tweet’s language using bcp47
  23. possibly_sensitive  (boolean): This field only surfaces when a Tweet contains a link. The meaning of the field doesn’t pertain to the Tweet content itself, but instead it is an indicator that the URL contained in the Tweet may contain content or media identified as sensitive content.
  24. withheld_copyright (boolean):When present and set to “true”, it indicates that this piece of content has been withheld due to a DMCA complaint
  25. place_name  (string): Short human-readable representation of the place’s name
  26. text (string): The possibly truncated text of the tweet. Best to use extended_full_text
  27. long (float): The longitude of the Tweet’s location
  28. lat (float): The  latitude of the Tweet’s location
  29. quoted_status_id_str  (string): This field only surfaces when the Tweet is a quote Tweet. This is the string representation Tweet ID of the quoted Tweet
  30. place_country_code  (string): Shortened country code representing the country containing this place
  31. rt_* Any column prefixed with `rt_` will contain information of the parent Tweet. Only present for Retweets.
  32. year  (string): A partition field for the Tweet’s created_at year.
  33. month  (string): A partition field for the Tweet’s created_at month.
  34. day  (string): A partition field for the Tweet’s created_at day.
  35. hour  (string): A partition field for the Tweet’s created_at hour.

Stats and Info

  • 3,500,000 tweets collected daily
  • Over 10TB of total data and growing
  • Corpus language is not limited to English
  • ~43% of total Tweets are in English
  • Stored in Parquet format for faster and more efficient queries
  • Data is partitioned by year/month/day/hour. When these are used in the `where` clause of a query, search space is greatly reduced.
  • Data is available to be queried (freshness) within ~300 seconds of Tweet publishing
  • Unicode/Emoji supports since mid 2018.
  • There is a nominal cost associated with each query and a payment solution must be worked out on a case by case basis.