the general algo goes like so:
for chunk in corpus:
e-step
m-stepgensim hacks in multiple passes:
for pass_ in passes:
for chunk in corpus:
e-step
m-stepwhat we've been doing (only works for batch):
for pass_ in passes:
for bound_iter in iters:
for chunk in corpus:
e-step
m-step
break if donefor online updates, would it make more sense to:
for chunk in corpus:
for bound_iter in iters:
e-step
m-step
break if donethis would give us something that works the same for batch (via chunksize=len(corpus) and bound_iters > 1)
but also something that works for online mode (via chunksize<len(corpus) and bound_iters > 1).
I mean the only reason I can see to use example 4 is if the updates are "actually" online. If chunks are being used just due to memory constraints but we're really running in a batch mode, then having chunks on the inside makes more sense and follows the algorithm in the paper more closely.