Created
May 7, 2019 02:20
-
-
Save toddkaufmann/577d0014e25717548e8a8513595234e4 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Last edited by Todd Kaufmann 3 years ago | |
= Proposed job scheduler change to support smaller chunksize, greater resilience = | |
== Idea: instead of looping over instances, loop over the set of jobs == | |
for example, 1000 commands, 10 instances --> 100 jobs (0..99), where job N is commands N10 .. N10+9. | |
Jobs have three states: | |
done | |
running - then job also has associated instance | |
waiting (to run) | |
at start set: | |
list of instances available (duplicate by # cores, or name inst#1,#2,etc) | |
all jobs are waiting: | |
job.state = waiting job.cmds = N10 .. N10+9 | |
general pseudocode for the idea: | |
Loop: | |
if instance available | |
get next waiting job | |
start: | |
job.state = running, job.instance = inst_id | |
remove inst_id from list | |
else | |
for each job that is running | |
if finished, then | |
if cmd was successfull # if there is a way to tell outputs are correct etc | |
job.state = done | |
job.instance goes back on list | |
else | |
job.state = waiting | |
job.instance goes back on list | |
job.cmd = job.cmd + " try-one-more-time " # or | |
job.tries += 1 | |
# ie run it again | |
else | |
# Additional feature, to support restart of jobs | |
check for response/still running etc (count++) | |
remove instance from list / restart or terminate ? | |
other jobs (if any) running on this instance then also need to be killed | |
ie, each running with job.instance == inst, is killed | |
killed jobs go back to job.state = waiting | |
if no jobs finished this time / no instances available, | |
print "d done, r running, w waiting" | |
then wait a while (sleep), continue loop | |
until no jobs exists where job.state = waiting or running |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment