Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it clear that what a training job is composed of #575

Closed
wangkuiyi opened this issue Jan 31, 2018 · 1 comment
Closed

Make it clear that what a training job is composed of #575

wangkuiyi opened this issue Jan 31, 2018 · 1 comment
Assignees

Comments

@wangkuiyi
Copy link

  • a Kubernetes job of trainers
  • a Kubernetes replica set of parameter servers
  • a Kubernetes replica set of the master -- we use the replica set to (1) make sure that there is one and only one master process in a job, and (2) recover it when it fails.

We might need a data type to represent this fact.

@wangkuiyi wangkuiyi self-assigned this Jan 31, 2018
@gongweibao
Copy link
Collaborator

we use the replica set to (1) make sure that there is one and only one master process in a job

we use etcd to
(1) make sure that there is one and only one master process in a job and (2) recover it when it fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants