Twitter research is something I thought I had overcome; something of the past, a relic of times and thoughts less well processed than those I like to think I have now. Apparently not.
What is unfortunate about social media (at least from my research perspective) is the amount of easy to access, contextually ‘fixed’, data, lowering the number of variables required for analysis. What is fortunate, is, well, what I just said.
To manage what I’m told is a ridiculous corpus, I built a CouchDB. Last time, I took Tweets, reordered their JSON data so _id replaced the id field, then dumped everything except the user data into the database. This was a HUGE amount of data. And I mean huge. It took me three months to load a 12.5% of all tweets in the May 2010 into a database, and used hundreds of GB of data for views. It was nasty.
This time, I stripped everything except the IDs (user + status), the targets (user + status), text, timezone (text form), and UTC offset (which, is more important, but the timezone is pretty). I didn’t keep the retweet field, as it’s missing in some, and not within the scope of what I care about. The result is 531,159,273 tweets loaded in 3 days. The views will take days, if not weeks (if not MONTHS) to build, but I’m limiting what I need, and starting with stale views to get preliminary findings.
Why CouchDB? Indexes, json to json, and the capability to not flip out on me when I do insane things.
The disk array is nothing flash: 4x2TB 5400rpm SATA-II disks running in RAID 10, attached using a round-robin configured dual gigabit NICs (PCI-E). This is managed by Core2Duo 4400 (2GHz per core), with 4GB of RAM. The box holding them does nothing but NFS share to the box running it.
The box running CouchDB is where I really cheat and move away from “commodity”: a Sun Fire X4600 M2 (so 8x dual core Opterons at 2.8GHz per core, with 32GB of RAM, and RAID1 10k rpm SAS drives). Not since moving from CouchDB 0.6 have I come close to using more than 400% (4 cores for those less than familiar with Linux), and the disk i/o is holding — Python appears to be my slow point.
The real problem is that the insertions happen too quickly that the views freak out and CouchDB goes down due to RAM usage issues if I try to update while I insert … but other than that, it’s perfect.
As usual, my loading script is a multiprocessing Python hackjob. I’d move it to C/++, but by the time I rewrote a socket proc-to-proc linking system, master+slave daemon configuration, the Python code would be done … and it’s not like I’m doing anything too far outside of what is already C modules for Python.