Why did you choose *that* for persistence?

We use levelDB because it’s fast” (followed by incoherent rambling where ‘fast’ is never actually defined)
 — #CowboyDeveloper
If you’re doing any kind of development, you’ll eventually run into a situation where you need to persist data. At this point, you’ll probably end up doing one of three things
1. Files: Easy one right? Just write your data to a file, and it’s there when you need it. That is, until you need to find stuff, which is when you discover
2. SQLite: Which gives you all sorts of fun query capabilities, and is great till you suddenly realize that you need to “scale”, or store “documents” (scare quotes because you usually don’t, but whatever). That’s when you discover
3. MongoDB: Ok, that’s a joke, and a bad joke at that. The thing is, you end up writing data to some sort of database.
For most people, that seems to be the point at which all rational cogitation ends. For whatever reason, the choice ends up being made based on “what people out there use”, and not what the problem needs”. Tradeoffs are rarely, if ever, thought of ( RUM conjecture? WTF is that?).
Mind you, what they should be asking themselves is questions around access methods and space usage, and make decisions based on these answers (•).
Mark Callaghan has a succinct writeup on the different types of index structures used in databases, and their relative tradeoffs, that can really help drive these decisions. To summarize
  • • b-tree: Efficient reads and costly writes — you can end up writing a page for each modified row. Space efficiency is “meh”, neither good or bad.
    ( InnoDB (MariaDB/MySQL), Postgres)
  • • Log Structured Merge: Efficient writes vs more costly reads. Space efficiency is great for leveled compaction, not so much for tiered.
    (Cassandra, LevelDB, RocksDB)
  • • index+log: Efficient writes without necessarily sacrificing reads, as long as the entire index is in RAM, otherwise garbage collection can start causing chaos. Space efficiency is in the “depends” category — the more efficient the writes, the more space you need (because GC gets slower, causing the log to get larger)
    (BitCask, ForestDB)
So yeah, don’t just pick levelDB “because Google uses it”, use it because it’s the right choice.
(And in most other cases, just use Postgres )
(•) Note that this is just a start. There are a whole other bunch of questions around ease-of-use, licensing, APIs, costs, and so forth that you’ll need to get to…

Comments

Popular posts from this blog

Erlang, Binaries, and Garbage Collection (Sigh)

Cannonball Tree!

Visualizing Prime Numbers