Posts

Showing posts from December, 2017

Dec 31,1995 - The last Calvin & Hobbes...

Image

Distributed Systems, and Coding for Failure

Image
The world of Distributed Systems is as much about philosophy as it is about implementation — the specific context, architecture, tooling, and developers all come together to make your way different from everybody else’s. And yet   yet , there are overarching themes and patterns that hold true, that you ignore at your own peril. The most important of these is the necessity to architect for failure — everything springs from this. (°) Complexity grows geometrically with the number of components, and no amount of fakes, strict contracts, mocks, and the like can reduce this. If anything, poor practices will only   increase   this complexity The key is to accept this, and align the risk associated with a given service/component with the risk the business is willing to take. In short,   make the component reliable, but no more reliable than necessary” As a developer, what this means is that you   must • Understand the operational semantics of the system as a whole • Internalize t

Lossless Data Compression, and… #DeepLearning?

Image
Way back, at the dawn of information theory, Claude Shannon showed that there was a lower bound to the amount you could — losslessly — compress data. Since then, there have been any number of “encoding” systems dreamt up to hit this bound. These systems all (roughly!!) boil down to  ° Find patterns in the data  ° Associate a symbol with these patterns  ° Transmit the symbol instead of the pattern The trick is in identifying patterns. For example, if you and I have the same book, I can just say “page 37, line 6” to specify any line, but, if I don’t know which book you have… Enter   #MachineLearning  — Recurrent Neural Networks specifically — which are particularly well suited to identify long term dependencies in the data. They are also capable of dealing with the “complexity explosion” as the number of symbols increase.  In this paper —  https://goo.gl/5SPwJB  — Kedar Tatwawadi works through this approach, to excellent results, usually beating the pants off of “old school” arit

Everything you wanted to know about Load-Balancing, but were afraid to ask (°)

Image
Herewith an   excellent   overview of the field , as well as a bit of a deep dive to get you acquainted with what’s going on these days. The future — and present! — is getting   way   fascinating, with L4 load-balancers moving towards distributed consistent hashing solutions, and L7 ones becoming increasingly popular thanks to the growth of microservices, sidecars, and service-meshes. “Global load balancing and a split between the control plane and the data plane is the future of load balancing and where the majority of future innovation and commercial opportunities will be found.” —  https://goo.gl/7m9sAu The above is particularly relevant given that the industry seems to have — finally! — embraced Observability, and in particular, the robust and protocol/system specific Observability needed by distributed systems (°) Or, frankly, didn’t know it even existed in the first place. Especially if you’ve been not paying attention for the last year or two.

Seems about correct to me...

Image

Mortgage Risk, and … #DeepLearning?

Image
If you have been in a poker game for a while, and you still don’t know who the patsy is, you’re the patsy. Now, for poker, substitute Finance, and in this case, Mortgage risk. If you’re still doing it by hand, or using complex curve-fitting, etc., well, you’re the patsy. Kay Giesecke from Stanford has   a fascinating paper   where he collected 294 parameters per loan across 120 million loans (from 1995 to 2014), and used this to predict the loan performance via #MachineLearning. The entirely unsurprising result — it beat the pants off of the “traditional methods” (logistic regression, curve fitting, etc. And note, this isn’t simple stuff, it is basically what institutions have been throwing Quants at for years now). The thing is, this is   exactly   the kind of problem that is really hard to model (hence the quants) — there are too many parameters, internal and external, and the differences in performance tends to be more about making the correct guesses about samples when

The Importance of Soft Skills

Image
The thing is, these days, any non-trivial system is but a cog in a much larger system which includes humans as elements, and the complexities around communication and collaboration are  the  dominant factor in success or failure. #TechBros and # CowboyDevelopers , however, are concerned about the increasing importance of these “soft skills (•), usually because • Imposter Syndrome : “I’m not good at that stuff” • Diminishment : “What does that mean wrt my mad golang skillz?” • Kubler-Ross : “I am the smartest architect, therefore I am also the best manager. Also, everybody loves me” Does your manager get this? Worse, does your manager  respect  this? Remember, the only thing worse than a “ Professional manager ” is the “ #TechBro turned manager who hates management ". You know the kind — they suffered through the former, and when given a chance at control, either turn into dictators (they know best!) or devolve into chaos (let a thousand flowers bloom!) Soft skillz a

Chaos Monkey meets Kubernetes

Image
For when you *really*need to be sure, herewith   #chaoskube to periodically kills random pods in your #Kubernetes cluster.   (°)  —  https://goo.gl/SgdD4u (°) Note, you probably want to mess around with the label selector, to make sure you don’t nuke any of the “critical” or “kube-system” pods. Unless you actually want to do that, of course

In Praise of Boring Tech (and Erlang)

Image
Let’s get something out of the way first               Boring =/= Bad Of course there is any amount of Boring   and   Bad (°) out there, but here the reference is to stuff that #JustWorks, whose implementation surface is well defined, whose failure modes are well understood, and whose edge cases tend to qualify as “Good Problems To Have”. #Postgres   is Boring.   #Varnish   is Boring.   #AWS   is Boring.   #Erlang   is Boring. Wait,   what ?   Erlang ?Yes indeed, it is, in many ways, the   quintessential   Boring technology. Remember, your job, as a technologist, is to keep the company in business. Everything else — reducing costs, increasing revenue, using the best tool, “having fun”, etc. — are subordinate to keeping the company in business (°°). And, keeping a system working reliably is waaay more expensive that building the damn thing in the first place. And, to date, there are few, if any, environments other than Erlang to build reliable systems. Oh yes, you can do

One of these IS like the other

Image
So, you’re using   #MachineLearning   to do image classification (of course you are ). The issue at hand is,   How do you differentiate the important stuff from the background? . It’s a bit of a trick question, because, well • What is “important”? That it’s a Cat (next to piglets)? • What is “background”? (You’re looking for piglets!) And so on. In   #DeepLearning , we call these “core” features vs “orthogonal” (or “style”) features, and differentiating between the two can be, well,   hard .  In a paper by Heinz-Demel and Menshausen (°), the authors come up with a neat trick to deal with this, one that requires a lot less effort than good old   #SupervisedLearning  — they assume that somewhere in the image is the   uniquely identifiable   thing you are looking for. For example   These images have   my   piglets in them . That’s pretty much it. Now that the model “knows” that my piglets are in there somewhere, it’ll happily ignore all the extraneous stuff including image qua

Differential Privacy — Under the Hood

Image
Apple released a little more information about #DifferentialPrivacy with a paper and an excellent blog post (°). A quick refresher - Under differential privacy, you • Collect user’s data (locally, or on the server) • Add noise to mask the data, but in such a way that when you • Average out the data, the noise averages out, and meaningful information emerges The trick, of course, is in the details, at the very least, you need to be careful in • calibrating the noise so that it averages out (algorithms matter!) • collecting and transmitting information in a manner that protects privacy (secure storage! encryption!) • keeping the information under privacy exposure thresholds (don’t combine data across different use cases! restrict the number of use cases!) etc. It's a fascinating read, and, I suspect, the very beginning of the field. I expect signal processing (Huffman coding, etc) to come into play any day now   (°) Learning with Privacy at Scale - the paper - h

DB/2 has HOW MANY code bases?

Image
(aka: “Mechanical Sympathy is, very much, a thing”)  Turns out that   DB/2 has   FOUR   separate code bases   for • AS/400 • MVS • VM • R/6000 (originally OS/2) They’re all, basically distinct lines (°), with Wildly different hardware/OS models. For example, the AS/400 line has   a single level store where memory and disk addresses are indistinguishable and objects can transparently move between disk and memory. It’s a capability-based system where pointers, whether to disk or memory, include the security permissions needed to access the object referenced.   Why would you   not   take advantage of this for your features — let alone, for optimization purposes? So yeah, in this light, it does make sense, especially when you add in the dictum market share is even more important than engineering efficiency via —   http://perspectives.mvdirona.com/2017/12/1187/ (°) Yeah, VM/CMS is a but of a special case, it’s the productization of the System R research system, and is pretty wil

Clustered vs Non-Clustered Indexes

Image
(aka: “ Phonebooks vs Book Indexes ”)  A clustered index is best thought of as a phonebook (°) — you index by a column, and all the data for that row is right there, next to that key. A non-clustered index, OTOH, is like the index in the back of a book — topics listed by alphabetical order, and pointers to every page where that topic is covered.  The key difference being that in non-clustered indexes, getting at the data is a two-step process (find the topic, then find the page) And yeah, it’s not quite this straightforward, but it should get you started… (°) “ Note to future SQL developers in 2030 who have no idea what a phonebook is: it’s something that stores the names, addresses, and landline phone numbers in your area. What’s a landline? Oh boy…” —  https://goo.gl/VXCPG3

User Interfaces, Design, and ... #DeepLearning?

Image
Good   #UI   design isn’t just about being “user-friendly”, it’s about   reifying deep principles about the world, making them the working conditions that users live and create , making explicit, yet comfortable and familiar, becoming part of the user’s patterns of thought. #MachineLearning   can be valuable here — removing vast swathes of cognitive burden to expose these deeper truths. Consider Font Design for example — bolding a font isn’t just about “making it thicker” () — there are a bajillion heuristics that type designers have figured out over the ages, things like “preserve the enclosed negative space”, “move bars lower”, “lag interior strokes to the exterior”, and hundreds more (Thousands? Even more? Quite likely…) The benefits of using   #AI   to help is that you can train it on these heuristics, using available fonts, and it will automatically generate the bolded versions for you. Oh, it won’t be perfect the first time around, but use your corrections as inputs, and,