-
Notifications
You must be signed in to change notification settings - Fork 198
Description
Precursor:
Currently all of our applications are deployed with StatefulSets vs being deployed with Deployments . The current UpdateStrategy of our StatefulSets is Rolling Updates. Here is an explanation of what it does and the other option we have:
Issue:
The combination of Rolling Updates && Libcluster is making it so that we can never add new services to the libcluster/horde registry. Because we will have
- 3 pods running all lets say with version 1
- Version 1 has libcluster/horde running but only Genserver_A is registered to it
- Then we trigger an update to this env with version 2
- In version 2 we have added a Genserver_B to be added to the libcluster/horde registry
- The Rolling Update will start with pod 2 out of 0,1,2. And it will not update the other pods with the new version until pod 2 is up and running
- However when pod 2 starts the libcluster detects pods 0 and 1 using the k8s labels and IPs and tries to register Genserver_B
- But pods 0 and 1 do not have the code for Genserver_B yet. So pod 2 crashes because it cannot start Genserver_B on the pod its trying to.
- And the Rolling update never proceeds to the other pods because pod 2 never passes
Or at least that is what i think is happening here. For the most part i think i have the issue correct and the error message on the pod that is crashing is
** (EXIT) an exception was raised:
** (UndefinedFunctionError) function Cogynt.Servers.Workers.CustomFields.start_link/1 is undefined or private
(cogynt 0.1.0) Cogynt.Servers.Workers.CustomFields.start_link([name: {:via, Horde.Registry, {Cogynt.Horde.HordeRegistry, Cogynt.Servers.Workers.CustomFields}}])
(horde 0.8.7) lib/horde/processes_supervisor.ex:766: Horde.ProcessesSupervisor.start_child/3
(horde 0.8.7) lib/horde/processes_supervisor.ex:752: Horde.ProcessesSupervisor.handle_start_child/2
(stdlib 3.17) gen_server.erl:721: :gen_server.try_handle_call/4
(stdlib 3.17) gen_server.erl:750: :gen_server.handle_msg/6
(stdlib 3.17) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Even though i know the version on that pod has the code for Cogynt.Servers.Workers.CustomFields.start_link
so it must be referring to one of the other 2 pods that had not got the new version yet.
Has anyone else every ran into this problem ?