Dataset of Historical Tweets

Currently limited to Wharton and Annenberg affiliated research.

graph
Daily volume (as of 7/28/19)

In partnership with The Annenberg School, we compiled a dataset of historical Tweets. Collection began in April 2012 and is ongoing. The total dataset represents about 1% of the total Twitter volume. Please email us to request access. The database uses Amazon Web Services’ Athena offering and is queryable using SQL. Queries can be made using the AWS console or using an ODBC connection on the HPCC.

Sample Queries

To search for the phrase wharton school (case insensitive) within the extended_full_text of all tweets between 09/02/2013 and 09/14/2013:

returns:

To search for tweets by a list of user_screen_name you could do something like:

To retrieve the most recent Tweet in English in the database  (as of July 29th, 2019):

You can learn more about lpad, dates and formatting here and here.

To chart the use of the term tariff since  1/20/2016, you could write a query like:

chart

tweet-chart

You can select tweets between two date/times like:

chart

Ordering Dates

Currently the created_at field is a string, so date ordering must be done on the timestamp_ms field. You can limit your search space using the following WHERE statement:

You can convert a Unix Epoch timestamp to a human readable format and visa versa here.

Schema

You can find Twitter’s Tweet object documentation here. Not all fields are available for all years. (All data is collected at tweet created_at time)

  1. created_at  (string): A datetime string of the tweets posting formatted like Mon Jul 15 17:08:32 +0000 2019
  2. timestamp_ms (string): A string representation of a bigint of the number of seconds since the Unix Epoch
  3. id (bigint): A integer representation of the Tweet ID.
  4. id_str (string): A string representation of the Tweet ID.
  5. extended_full_text (string): The body of the full tweet text (280 characters)
  6. user_id (string): A string representation of the Tweet’s user ID
  7. user_name (string): A string representation or the Tweet’s user name.
  8. user_screen_name (string): A string representation of the Tweet’s user screen name.
  9. user_followers_count (int): An integer representation of the Tweet’s user followers count.
  10. user_location (string): A string representation of the Tweet’s user set location.
  11. user_created_at (string): A datetime string representation of the Tweet’s user account creation formatted like Mon Jul 15 17:08:32 +0000 2019
  12. user_friends_count (int): An integer representation of the Tweet’s user friends count
  13. user_statuses_count (int): An integer representation of the Tweet’s user total statuses
  14. user_verified (boolean): Set to true if the Tweet’s user account has been verified
  15. source (string): A string representation of the utility used to post the Tweet. (e.g. <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>)
  16. truncated (boolean): Indicates whether the value of the text parameter was truncated
  17. in_reply_to_status_id  (bigint): If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s ID
  18. in_reply_to_status_id  (String): If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s ID
  19. in_reply_to_screen_name  (String): If the represented Tweet is a reply, this field will contain the screen name of the original Tweet’s author
  20. is_quote_status  (boolean): Indicates whether this is a Quoted Tweet
  21. lang  (string): Tweet’s language using bcp47
  22. possibly_sensitive  (boolean): This field only surfaces when a Tweet contains a link. The meaning of the field doesn’t pertain to the Tweet content itself, but instead it is an indicator that the URL contained in the Tweet may contain content or media identified as sensitive content.
  23. withheld_copyright (boolean):When present and set to “true”, it indicates that this piece of content has been withheld due to a DMCA complaint
  24. place_name  (string): Short human-readable representation of the place’s name
  25. text (string): The possibly truncated text of the tweet. Best to use extended_full_text
  26. long (float): The longitude of the Tweet’s location
  27. lat (float): The  latitude of the Tweet’s location
  28. quoted_status_id_str  (string): This field only surfaces when the Tweet is a quote Tweet. This is the string representation Tweet ID of the quoted Tweet
  29. place_country_code  (string): Shortened country code representing the country containing this place
  30. rt_* Any column prefixed with rt_ will contain information of the parent Tweet. Only present for Retweets.
  31. year  (string): A partition field for the Tweet’s created_at year.
  32. month  (string): A partition field for the Tweet’s created_at month.
  33. day  (string): A partition field for the Tweet’s created_at day.
  34. hour  (string): A partition field for the Tweet’s created_at hour.

Stats and Info

  • 3,500,000 tweets collected daily
  • Over 8TB of total data and growing
  • Corpus language is not limited to English
  • ~43% of total Tweets are in English
  • Stored in Parquet format for faster and more efficient queries
  • Data is partitioned by year/month/day/hour. When these are used in the where clause of a query, search space is greatly reduced.
  • Data is available to be queried (freshness) within ~300 seconds of Tweet publishing
  • Unicode/Emoji supports since mid 2018.
  • There is a nominal cost associated with each query and a payment solution must be worked out on a case by case basis.