I’ve been doing a lot of throughput optimizations on Dynomite lately. I got the unsaturated latency down to a point where I was happy with it, so the next step was to optimize the throughput of a fully saturated node. Generally when I find something interesting or at least strike something off the list as being a culprit for throughput issues I’ll fire off a cryptic message into the aether of twitter. This morning I posted this:

And it piqued a lot more interest than I thought it would.
Deferring the response for a gen_server call is a popular performance optimization in Erlang applications. For some folks, it’s the default behavior they use handle_call callbacks that do not modify the state of the underlying server. The logic of doing so seems sound. After all, spinning up new processes is cheap and a gen_server call only handle so much throughput, so spawning off a process to handle each call should be a good strategy to eliminate bottlenecks. Generally, this strategy looks something like this:
handle_call(do_something, From, State) ->
spawn_link(fun() ->
Result = do_a_bunch_of_shit(State),
gen_server:reply(From, Result)
end),
{noreply, State};
So in trying to reduce the bottlenecks in the dynomite code base, I tried doing this to the membership server. The membership server is in the critical path for all operations: get, put, has, and remove. The mediator will call membership:servers_for_key which hashes the key to its partition and will do a lookup in its routing table to figure out which nodes are currently serving that partition. I wrote a minibenchmark test to see what would happen.
{ok, _} = membership:start_link(a, [a,b,c,d,e,f]),
{Keys, _} = lib_misc:fast_acc(fun({List, Str}) ->
Mod = lib_misc:succ(Str),
{[Mod|List], Mod}
end, {[], "aaaaaaaa"}, 10000),
Start = lib_misc:now_float(),
lists:foreach(fun(Str) ->
membership:servers_for_key(Str)
end, Keys),
End = lib_misc:now_float(),
?debugFmt("membership can do ~p reqs/s", [10000/(End-Start)])
The original version of the servers_for_key callback:
handle_call({servers_for_key, Key}, From, State) ->
Config = configuration:get_config(),
Part = int_partition_for_key(Key, State, Config),
Nodes = int_nodes_for_partition(Part, State),
MapFun = fun(Node) -> {list_to_atom(lists:concat([storage_, Part])), Node} end,
{reply, lists:map(MapFun, Nodes), State};
And the deferred response version:
handle_call({servers_for_key, Key}, From, State) ->
Config = configuration:get_config(),
spawn_link(fun() ->
Nodes = int_nodes_for_key(Key, State, Config),
Part = int_partition_for_key(Key, State, Config),
MapFun = fun(Node) -> {list_to_atom(lists:concat([storage_, Part])), Node} end,
gen_server:reply(From, lists:map(MapFun, Nodes))
end),
{noreply, State};
The difference in throughput between the two is pretty astonishing. The first version can do somewhere in the neighborhood of 85000 reqs/s. The second version can only manage about 12000 reqs/s. And in this case I think the difference actually makes sense because it’s essentially doing a CPU bound operation. So by spawning off we are essentially asking the CPU to do a lot more work to arrive at the same answer. There is the setup overhead for a new process. Then the overhead of copying the closure from the fun into that new process. Not to mention that once we’re dealing with multiple copies of that same membership data you can pretty much count out getting many cache hits at the CPU.
Figuring this out has challenged some of my assumptions about the Erlang VM. It’s also given me some ideas for other areas to look at in optimizing Dynomite performance. In particular, I am going to revisit the code around the pmap being used to distribute operations to the different nodes.
And the whole “closures and scope” thing was a reference to this hilarious exchange.