I’ve just read a mildly interesting article The Twitpocalypse is Near: Will Your Twitter Client Survive? which talks about how any Twitter client using a signed 32bit integer to store twitter status_ids is going to break in the next few days when the number of tweets surpasses the maximum possible value of a signed 32bit integer which is 2,147,483,648.
That got me thinking about how much space Twitter must be using up to store all of those status updates. If we disregard storage of who posted a tweet and where it was posted from then we can assume (that’s the first of many assumptions in this blog post) that Twitter’s table of tweets looks something like this:
CREATE TABLE statuses (
status_id BIGINT,
status VARCHAR(140),
post_dt TIMESTAMP
)
According to the MySQL documentation BIGINTs take up 8 bytes and TIMESTAMPs take up 4 bytes. The space used by a VARCHAR depends both on the length of the value being stored and the character set being used so if we assume the character set is UTF-8 and we also assume that that means each character takes up 2 bytes (not a safe assumption apparently) then the maximum space used by a particular status will be 281 bytes (140*2 + 1 byte to represent the length). Hence, the maximum possible length of a record in this table is 293 bytes.
Of course, very few tweets are 140 characters in length so we need to know what the average tweet length is. I’ve examined the underlying data for my Tweetpoll application* and rudimentarily calculated that the average tweet length is around about 92 characters. Thus, we can expect that the average length of a tweet in the table can be calculated as:
| status_id | + | status | + | post_dt |
| 8 | + | (92 * 2) + 1 | + | 4 |
That’s 197 bytes in total per row. So, when twitter hits that maximum possible value of a 32bit signed integer sometime in the next 7 days I estimate that this table is going to be occupying 197 * 2147483648 = 423054278656 bytes. Or, in numbers we can understand, 394GB.
By the way, don’t fret that Twitter itself is going to break any time soon; they use 64bit unsigned integers to store status_ids so the Twitter service itself is going to be OK for a while. Storing status_ids in 64bit unsigned integers means the theoretical maximum number of tweets is 18,446,744,073,709,551,615 or, as Programmable web point out, 2.7 billion tweets for every person on the planet. When that limit is (theoretically) reached in the year [fill in arbitrarily chosen year here] Twitter are going to need something in the region of 3.2million petabytes of disk space to store them all (i.e. 3,634,008,582,520,781,668,155 bytes). To put that into perspective, that’s about the same as 68 billion Blu Ray disks completely filled up with tweets.
Now, back to some real work….
-Jamie
Disclaimer: I’m sure I don’t need to point out that this blog post is purely based on assumptions and my, usually rather hopeless, mathematical abilities so the numbers are completely bogus and anyone else attempting this calculation would probably come up with totally different ones!
*At the time of writing Tweetpoll is viewable because I don’t have to pay for it but if it is not available by the time you come to read this then its because Windows Azure has reached general availability in which case you can read more about Tweetpoll at Tweetpoll – My first Windows Azure application is live