Canary Release, and Experimentation
Canary Release (also called phased, or incrementaldeployment) is when you deploy your new release to a subset of servers, check it out there against real live production traffic, and take it from there — rolling it out, or rolling it back, depending on what happened. You’re basically doing the Canary In A Coal Mine thing, using a few servers and live traffic to ensure that at worst, you only affect a subset of your users.
It’s not a bad approach at all, and depending on how you do this, can be quite efficient resource-wise (you don’t need an entire second environment a-la Blue-Green releases). Mind you, the flip side to this is that you need to reallycareful about compatibility. You’ve got multiple releases running at the same time, so things like data versioning, persistence formats, process flows, transaction recovery, etc. need to either be forward/backward compatible, or very (!!!) carefully partitioned/rollback-able.
The tricky part here though is when you do this, but also run A/B tests on your customers at the same time. You need to be very very sure that your instrumentation is up to par! After all, you want to be able to differentiate clearly between “People don’t like the new feature because it’s blue”, and “People don’t like the new feature because something barfed in the release”, right?
There are any number of ways that you can deal with this, with most of them boiling down to some combination of “make the releases completely orthogonal to the experiments” and “treat the releases as experiments too”.
Personally, I distrust the latter. Treating releases as experiments requires that you be able to clearly identify the dependencies and impacts of each release, and be able to do point-wise rollbacks across multiple different strains of code.
Which is great if you can do that, but you probably can’t (unless you’re Google, of course! ). Just keeping rack of the dependency graphs in the existing code is bad enough, let alone tracking how it evolves over time…
Which is great if you can do that, but you probably can’t (unless you’re Google, of course! ). Just keeping rack of the dependency graphs in the existing code is bad enough, let alone tracking how it evolves over time…
The easy out here is to let your releases “bake” before running experiments. So, for example, you only turn on feature flags once they’ve been out there for, oh, a week. Mind you, there is a lot of water being carried in that statement — you need to know that once the release is out there user traffic hasn’t been affected, there are no new alerts in the system, CPU/memory usage are unaffected etc. And then once you enable the feature-flag, you have to be sure that the data you gather is clearly and unambiguously not affected by anything other than the feature being tested.
The bottom line — if you can’t make the above guarantees, treat your results as suspect. Oh, there might be some information buried in there — potentially useful information too — but, well, be careful!
Comments