Posts

Showing posts with the label OLTP

Should you be using Arrow?

Image
Apache Arrow   is many things, but the really important bit is that it specifies a columnar memory format for data that is language-independent, set up for flat or hierarchical data, and has a   really   nifty set of accompanying libraries for zero-copy streaming, IPC, analytics, and the like. The question then is, why would you even begin to worry about storage formats? Well, you probably don’t, or won’t, until you bump up against the reality of large scale data-processing, and moving this data across multiple databases, which is when life starts getting hairy. The thing is, when you actually look at how databases store data, they have all sorts of additional stuff stored alongside the data. Fun stuff like references to log-records (your recovery-manager needs to be able to process the logs y’know?), MVCC timestamps, index information, and lords know what else. And given that different databases have different mechanisms for doing logging, concurrency, indexes, a...