Skip to content

Issue with StatefulSet Rolling Update Strategy #180

@amacciola

Description

@amacciola

Precursor:

Currently all of our applications are deployed with StatefulSets vs being deployed with Deployments . The current UpdateStrategy of our StatefulSets is Rolling Updates. Here is an explanation of what it does and the other option we have:
Screen Shot 2022-07-21 at 10 29 21 AM

Issue:

The combination of Rolling Updates && Libcluster is making it so that we can never add new services to the libcluster/horde registry. Because we will have

  1. 3 pods running all lets say with version 1
  2. Version 1 has libcluster/horde running but only Genserver_A is registered to it
  3. Then we trigger an update to this env with version 2
  4. In version 2 we have added a Genserver_B to be added to the libcluster/horde registry
  5. The Rolling Update will start with pod 2 out of 0,1,2. And it will not update the other pods with the new version until pod 2 is up and running
  6. However when pod 2 starts the libcluster detects pods 0 and 1 using the k8s labels and IPs and tries to register Genserver_B
  7. But pods 0 and 1 do not have the code for Genserver_B yet. So pod 2 crashes because it cannot start Genserver_B on the pod its trying to.
  8. And the Rolling update never proceeds to the other pods because pod 2 never passes

Or at least that is what i think is happening here. For the most part i think i have the issue correct and the error message on the pod that is crashing is

        ** (EXIT) an exception was raised:
            ** (UndefinedFunctionError) function Cogynt.Servers.Workers.CustomFields.start_link/1 is undefined or private
                (cogynt 0.1.0) Cogynt.Servers.Workers.CustomFields.start_link([name: {:via, Horde.Registry, {Cogynt.Horde.HordeRegistry, Cogynt.Servers.Workers.CustomFields}}])
                (horde 0.8.7) lib/horde/processes_supervisor.ex:766: Horde.ProcessesSupervisor.start_child/3
                (horde 0.8.7) lib/horde/processes_supervisor.ex:752: Horde.ProcessesSupervisor.handle_start_child/2
                (stdlib 3.17) gen_server.erl:721: :gen_server.try_handle_call/4
                (stdlib 3.17) gen_server.erl:750: :gen_server.handle_msg/6
                (stdlib 3.17) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

Even though i know the version on that pod has the code for Cogynt.Servers.Workers.CustomFields.start_link so it must be referring to one of the other 2 pods that had not got the new version yet.

Has anyone else every ran into this problem ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions