> How does a director recover from a failure? From my understanding it would require fetching all job IDs from jobs that are in an active state via the job transactions table (which sounds expensive) and then loading the associated meta-data from the jobs table? Is that correct?
This is correct, if a director crashes for any reason, it needs to scan the database on boot. It is a more expensive operation, which is part of the reason that we try and cap the number of total entries in a given database.
> Do you assume that the director will have archived all non-completed jobs when deleting the database? Do you try to gracefully shut down the director first then? From my understanding it seems you perform a "drop table" statement on a given database and then regenerate the tables, but this would require being sure that all the jobs have been processed or archived.
Great question, we glossed over this aspect this a bit in the post itself.
Before a given database is transitioned to the 'spare' state, and its tables dropped, a single Drainer process is responsible for moving any non-completed jobs from that database to another active Director. The Drainer will not successfully exit and transition the database to 'spare' until it is certain it has processed all the non-completed jobs. We never drop any tables which have non-terminal jobs. Similar to the Directors, the Drainer will acquire a lock in consul to ensure only a single process is draining at a time.
We're hoping to go into a bit more depth on how the drainers work and these jobs move around in an upcoming post on Centrifuge's two-phase commit semantics. Ensuring that your data has moved to another system does require fairly complex transactional semantics, so we're hoping to go into depth about how this works.
> How does a director recover from a failure? From my understanding it would require fetching all job IDs from jobs that are in an active state via the job transactions table (which sounds expensive) and then loading the associated meta-data from the jobs table? Is that correct?
This is correct, if a director crashes for any reason, it needs to scan the database on boot. It is a more expensive operation, which is part of the reason that we try and cap the number of total entries in a given database.
> Do you assume that the director will have archived all non-completed jobs when deleting the database? Do you try to gracefully shut down the director first then? From my understanding it seems you perform a "drop table" statement on a given database and then regenerate the tables, but this would require being sure that all the jobs have been processed or archived.
Great question, we glossed over this aspect this a bit in the post itself.
Before a given database is transitioned to the 'spare' state, and its tables dropped, a single Drainer process is responsible for moving any non-completed jobs from that database to another active Director. The Drainer will not successfully exit and transition the database to 'spare' until it is certain it has processed all the non-completed jobs. We never drop any tables which have non-terminal jobs. Similar to the Directors, the Drainer will acquire a lock in consul to ensure only a single process is draining at a time.
We're hoping to go into a bit more depth on how the drainers work and these jobs move around in an upcoming post on Centrifuge's two-phase commit semantics. Ensuring that your data has moved to another system does require fairly complex transactional semantics, so we're hoping to go into depth about how this works.