Erlang Deep Bugs (At least one of which has been fixed recently...)

Eons ago, in the paleolithic era (2009) when R13 was fresh and new and neat and The Greatest Thing Ever we ran into a little tiny problem.  Ok, it wasn't a little tiny problem, it was a great big honking problem, viz., Shit Stopped Working.

(The cry of "Elaborate Please"rings through the streets of New York)

Fine, fine.
Y'see we run a VoIP service in the Cloud (yeah, yeah. don't snicker) - a fairly standard thing nowadays with phones, calling, users, GUIs, etc.
The thing about phone services, is that people expect to be able to make phone calls.
And most calls stopped going through.  Whats worse, most people couldn't log in to complain that their calls weren't going through, and most people couldn't call us to tell us that they couldn't call us, and so on and so forth.

Notice the use of "most"? Its intentional.  Some of the calls - apparently random ones - worked, but by and large, it would take up to 3 minutes for your call to go through (we automatically hung up on you after three minutes of FIAL anyhow, and that part worked just fine.)

This, of course, at 10:00AM on wednesday. Panic in the streets, fire-alarms going off, the works.

By the way, literally fire-alarms.  The building was evacuated, while I sat in the conference room frantically trying to keep things up on one hand, while debugging like mad on the other.  My CEO, Doug Strahan, in an epically brave and kind gesture, hung out w/ me in the conference room to - as he puts it - "rescue me from myself if things got dire"

Whats the point behind all this?
I'm glad you asked.
It turned out that we had a Process Problem.
Y'see, every  call, every user , every login, (basically everything) was a process. And, given that our system was very distributed (remember that?) we registered these processes globally. And when you wanted to do something (like make a call), you'd basically end up running

gen_server:call({global, SomeProcess}, {make_a_call, With, These, Parameters}).

So?
Nothing wrong with that, eh?
Ah, I'm glad you asked.  It turns out that when you wade through the code for gen_server.erl, you end up with the following
  •  gen_server:call ends up in gen:call
  • gen:call  actually use where/1 to identify the node that the process is on
  • where/1 is actually a shout out to global:safe_whereis_name/1
  • global:safe_whereis_name/1 is actually...
Wait, what?  WTF is global:safe_whereis_name/1?  Clearly not in the docs, is it?

Turns out its a hidden function that runs the following piece of charming code

gen_server:call(global_name_server, {whereis, Name}, infinity)

This was roughly the point at which I started cursing softly but oh so profoundly.  (In case you didn't get it, all requests to gen_server:call({global, SomeProcess}, ...) are serialized through one gen_server!!!!!!!)

Which may be fine for small systems, but when you have a bajillion phone calls and a kajillion web-requests hitting your Highly Distributred And Theoretically Scalable Cluster, well, this is about as useful as a lockpick made from shoelaces, and is just about as painful as a kick in the nuts. Maybe more.

The solution was really simple - we basically replaced 

gen_server:call({global, SomeProcess}, {make_a_call, With, These, Parameters}).

with

Pid = global:whereis_name(SomeProcess),
gen_server:call(Pid, {make_a_call, With, These, Parameters}).

uploaded stuff, and Boom, everything started working like a charm.
 

I know, I know, what if something happens between the two statements, right? Don't worry about it - the above is a simplified version of what we did, but it should get the point across.  And you get no points for being a wise-ass

Moral of the story?  I'm really not quite sure.  It was a righteous PITA to track down, but once we (Juan Jose - Erlang-er extraordinaire - and myself) had a rough idea of what was going on, we found it right quick through the simple expedient of wading through the Erlang distro source code.  Which, BTW, is highly readable, and fairly easy to grasp.

Coda 1:
   Geoff Cant (@archaelus) ran into the same thing back in 3/2011. After some back and forth, it turned out that there was a "deep bug" that required things to be the way they were.

Coda 2:
   As of R15B01, the bug is gone.  In gen.erl, where/1 is now a shoutout to global:whereis_name/1, and global:safe_whereis_name/1 has ceased to be...  

From the release notes
Calls to global:whereis_name/1 have been substituted for calls to global:safe_whereis_name/1 since the latter is not safe at all.
The reason for not doing this earlier is that setting a global lock masked out a bug concerning the restart of supervised children. The bug has now been fixed by a modification of global:whereis_name/1. (Thanks to Ulf Wiger for code contribution.)
The undocumented function global:safe_whereis_name/1 has been removed. 

Life is definitely looking up. I just wish it had looked up earlier... 

Comments

Popular posts from this blog

Cannonball Tree!