In my last semester as a student, I had the chance of working for an awesome company (Acquia) on a very interesting project. It all started with a post over at Dries Buytaert's blog. He is the CTO and co-founder of Acquia (and Mollom), inventor of Drupal, open-source celebrity and also knows a thing or two about FPGA-aware garbage collection :).
I had a great time with a lot of brilliant people and learned tons and tons about datastructures, scalability, nosql, ruby and asynchronous I/O.
This thesis documents my experiences trying to handle over 100 million sets of data while keeping them searchable. All of that happens while collecting and analyzing about 100 new domains per second. It covers topics from the different Ruby VMs (JRuby, Rubinius, YARV, MRI) to different storage-backend (Riak, Cassandra, MongoDB, Redis, CouchDB, Tokyo Cabinet, MySQL, Postgres, ...) and the data-structures that they use in the background.
Long story short, here we go: [PDF]
p.s. acquia is hiring ;)