Should you be using Arrow?

Apache Arrow is many things, but the really important bit is that it specifies a columnar memory format for data that is language-independent, set up for flat or hierarchical data, and has a really nifty set of accompanying libraries for zero-copy streaming, IPC, analytics, and the like.
The question then is, why would you even begin to worry about storage formats? Well, you probably don’t, or won’t, until you bump up against the reality of large scale data-processing, and moving this data across multiple databases, which is when life starts getting hairy.
The thing is, when you actually look at how databases store data, they have all sorts of additional stuff stored alongside the data. Fun stuff like references to log-records (your recovery-manager needs to be able to process the logs y’know?), MVCC timestamps, index information, and lords know what else.
And given that different databases have different mechanisms for doing logging, concurrency, indexes, and whatnot, moving data means translating across all these domains. 
(And yeah, the way you do this is to serialize / deserialize across some intermediary format — a necessary, and unavoidable PITA)
The good news here is that the situation is a-changing, thanks to a couple of key developments in the field. As Daniel Abadi puts it
First, people stopped believing that one size fits all for database systems, and different systems started being used for different workloads. Most notably, systems that specialized in data analysis …tend to either be read-only or read-mostly systems, and therefore generally have far simpler concurrency control and recovery logic.
Second, as the price of memory has rapidly declined, a larger percentage of database applications fit entirely in main memory [and] simpler recovery and buffer manager logic…
Finally, … open source database systems [have highly] modular design and clean interfaces between the components of the system.
(I’d also also add in the advent of NVRAM that is changing the nature of logging)
And this gets us right back to Arrow. Engines that use Arrow to main-memory data processing can avoid the entire serialization/deserialization cycle mentioned above, and reap the benefits (performance improvement, time efficiency, etc.)
In a recent post “An analysis of the strengths and weaknesses of Apache Arrow”, Daniel Abadi explores the nature of these benefits, and in particular, the impact of the choices made by Arrow in storage format on these. Three items of note are
  1. 1. Columnar Storage : Being a columnar store, Arrow stores data for each entity by attribute. In effect, it stores the first attribute for all entities contiguously, then the second attribute, and so on.
    The result is that if you want to process all the attributes of an entity, you have to jump all over the place, making OLTP style workloads a pain. On the other hand, if you’re doing the kind of analytics where you need to process one attribute for a given entity (“find the average height” etc), well, this is exactly what you need!
    (On a side note, this makes data compression, SIMD, etc. much, mucheasier too!)
  2. 2. Fixed Width Data and Nulls: Null elements in Arrow arrays take up as much space as non-nulls. This allows for finding the n-th element of an array by simply multiplying the fixed-width size of the element by n. The downside, of course, is that the array is now larger (all those nulls!).
    It should be mentioned that there is a separate array header that is a “null-bitmap”, basically indicating which elements in the array are null
  3. 3. Variable Width Data and Separators : Well, no separators, Arrow doesn’t use them — it just slams all the data one after the other. It has a separate a integer array alongside this data that keeps the offset of the first byte of each element.
    The upside is more generality. The downside is that depending on the type of data access, you might need to hit the integer array over and over again to work on the data
The point here is that Arrow can be awesome, but it really really depends on the type of processing that you are doing. Go read the whole thing to get a better picture…

Comments

Popular posts from this blog

Cannonball Tree!

Erlang, Binaries, and Garbage Collection (Sigh)