Support for a cluster setup where master does training + checkpointing, and a separate eval node does continuous evaluation over the last checkpoint. Likely a few details to figure out... at some point.