<prev [index] next>

RendView Manual -- The Taskmanager

The taskmanager is the core of RendView; it is basically an event-driven state machine which takes several actions depending on arriving events and the current state. (If you are looking for it: The code is in taskmanager.cpp.) The taskmanager is responsible for launching jobs, terminating RendView, handling arriving signals, etc.

Basic operation

Normally, when RendView gets started, you pass all the information it needs on the command line. It then runs until it has done all the work you want it to be done or it thinks it cannot go on any more.
While RendView is running, you can send signals to it.
When it has done all the work reported by the task source or it thinks it cannot go on any more, RendView acts in different ways depending on the operation mode. Normal RendView and the LDR serer exit in this case while the LDR client just recovers which means that it disconnects from the server and gets back into a state which is similar to the one just after starting it: It simply waits idle for a new server connection. When recovery is triggered (e.g. by a sudden server disconnect), all the currently running jobs are killed (using SIGTERM and then SIGKILL, see signals below and -ld-term-kill-delay) and the assigned tasks are deleted. This means that there is no way to benefit from the work which was done for tasks which were not yet reported back to the server. However, this case only occurs in severe circumstances, e.g. if you SIGKILL (not SIGTERM) the server, if the network goes down or if there is a bug in RendView.
Normally, the LDR client gives back tasks (even unfinished frames if it killed the render process) and the LDR server waits for all clients to disconnect before exiting. (There are timeouts, however, see -Ld-rtimeout, -L-kamult.)

A word about LDR

You may have guessed that already: LDR works in the following way: You have several computers to have the work done and you start one LDR client on each of these boxes (rendview -opmode=ldrclient, see also -opmode). Then, whenever you want some frames to be rendered, you simply call the LDR server (rendview -opmode=ldrserver). The server will then connect to all the clients (in fact you pass their address and port using -Ld-clients) and give them the tasks. This means that all the required files are downloaded to the client (but see also -L-transfer, -l-r-files), even unfinished frames (resume operation, only) and the success info as well as the rendered/filtered frame are uploaded to the server again. Additional files (-l-r-files, -l-f-files) are normally only downloaded if needed (i.e. either the client does not have that file or it was modified) to save network bandwidth and CPU.
While the LDR server is running, it stays connected to all the LDR clients (using one TCP connection per client). A client can only be connected to no more than one authenticated server at any time and here is no possibility for client disconnect during work. (This is necessary because both the client and the server must be able to send data to each other at any time while working.)

The path of the tasks

The taskmanager manages the internal path all the tasks take. It has three task queues, todo, proc and done and the number of tasks in these queues can be read in several verbose messages.
When a task is obtained from the task source (local: because taskmanager asked the task source to supply one more task; LDR: because the server sends one more task to the client), then that task is put into the todo queue and waits there to be processed.
There are normally a couple of tasks in the todo queue so that the taskmanager can server them for processing at any time without having to wait for the task source. Avoiding times of inactivity (no renderer running) was one of the major goals in early development. Especially for LDR it is important that the LDR client has always some tasks around so that we do not have to wait idle until a new task and all the required files are downloaded over the network.
Do not be surprised if the task manager seems to always report or request a couple of tasks in sequence and then does not do any request or termination report (back to the task source) for a longer time. The todo and done queues trigger the talking to the task source using a threshold model (if you want to tweak around, see -ld-todo-thresh-low, -ld-todo-thresh-high, -ld-done-thresh-high, the same with -Ld instead of -ld as well as -Ld-max-client-task-thresh).

Once the taskmanager launches a job, the task is put from todo into proc queue where it stays until the job terminates or the LDR client reports the task back (as done/partly/not processed). The task is then put into done queue or back into todo queue: If the job failed or is completely processed, it is put into done queue, if it is not completely processed (i.e. rendered but not filtered) or not processed at all (i.e. LDR client gives it back unprocessed), it is put into todo.

The done queue is explained very easily: it just accumulates some tasks before they are reported back to the task source.
Hence, the tasks originate from the task source and they finally end up at the task source. The task manager only knows about a couple of tasks at a time which is enough it has to know.

Failures and info

You get quite a lot of information about what if being done (unless you switch that off using the -verbose option). Needless to say that you get informed about errors and failures.
There is a simple protection against RendView launching lots of failing tasks in sequence (which may happen if the path to the renderer is wrong, an additional file is missing or whatever): In case several (normally 3) tasks fail in sequence, the task manager thinks it makes no sense to continue work and schedules quit (or recovery in case of LDR client). See -max-failed-in-seq if you want to tune that.

Fit strategy for LDR

The "fit strategy" means the way how to assign jobs to clients in order to achieve good performance. This task is extremely difficult especially if the used computers are not equally fast.
RendView implements the following simple stategy which should work quite well (especially when using equally-fast clients with equal capabilities):
As long as the task source did not report the last frame, there is no pressure on the algorithm and it assigns tasks to clients in the following way: Take the first task in the todo queue and give it to the best client. If there is none, go to the next task in the queue, and so on. The best client is the one which has the least number of assigned tasks relative to the number of simultanious jobs (aka CPUs) and which can process the task completely. If no client can process it completely, a client which can to it partly is chosen. If two or more clients are equally-qualified, a round robin method is used.
If the task source reports the last frame, the "tight condition" comes. The best client is now the one with the least number of assigned tasks relatve to its number of jobs, (nearly) irrespective if the client can to the task completely or partly. Especially, no client gets more tasks than it its njobs value (aka number of CPUs) and in case some clients have more tasks while others need some, the clients are requested to give back all tasks which they are currently not processing (for immediate redistribution to clients which need tasks). [Such a give-back is normally necessary only once.]
Note that this strategy may not work very well if the client's abilities are assymetric, e.g. there are 10 render clients and one render and filter client and that it will not smartly handle the case of differently-fast computers (as it will never request a client to kill some already running job).

An important note about render/filter shell scripts

Of course, you may use a shell script instead of a renderer or filter program (by specifying the shell script as the binpath in the correcponding render/filter desc, see component data base). There are many reasons why one would like to do so. However, there some important points you must keep in mind when doing that.

Do not start background jobs. RendView's taskmanager needs one process which gets started (the shell script in this case) and which terminated just when the job is done. If you start a background job, the shell script may terminate before the job is done leading to corrupt frames and/or errors. (Well, you may start background jobs but then make sure that the script waits for them to complete before exiting.)

Handle signals properly. This is very important. RendView expects a normal process which can be stopped using SIGTSTP and continued using SIGCONT as well as killed using SIGTERM. Make sure you react quickly to SIGTERM or set the -ld-term-kill-delay large enough.
The signal handling is normally acomplished using the trap call in shell scripts. Consult you shell manual for more information.

Use a proper return/exit code. RendView normally expects processes to return 0 on success and sone non-zero value on failure.
(Unfortunately, POVRay is bugged in a way that it returns success even if parsing failed and RendView's POVRay driver has some file existance and time stamp logic to deal with that quite well.)

Signal handling

RendView understands the following signals with their corresponding action:

SIGINT   (Terminal: often ^C)
Upon catching the first interrupt signal, the task manager will give back all processed and not yet processed tasks to the task source. The tasks currently running continue execution and RendView waits for them to finish. (That means: Frames which are being rendered will also be filtered before RendView exits.) When all running tasks exitet, RendView will quit gracefully (i.e. disconnect from the task driver interface and the task source).
For LDR: The LDR server will request the clients to give back all not yet processed tasks and all done tasks when receiving that signal, so that you only have to wait for tasks which are currently being processed by an LDR client.

When catching the second SIGINT, the task manager kills all the currently running processes (using SIGTERM and, if they do not terminate within some time, see -ld-term-kill-delay, then finally kills them using SIGKILL). Renderers can catch the SIGTERM and terminate gracefully leaving an unfinished frame which can be resumed lateron (see -l-cont and -l-rcont). When all the jobs terminated (because they were just killed), RendView exits cleanly.

When catching the third interrupt signal, RendView instantly aborts execution (using abort(3) maybe dumping core). Do not provoke that unless it is necessary.
Catching termination signal is exactly the same as catching two SIGINTs. This means that if you are running RendView as batch job and the computer shuts down sending all processes the TERM signal, then RendView exits cleanly as fast as possible (normally leaving unfinished frames which can be resumed, see SIGINTabove).
SIGTSTP   (Terminal: often ^Z)
Upon catching terminal stop signal, RendView can act in two ways: If the task source is a local one (i.e. RendView and LDRserver), RendView stops all currently running tasks (using SIGTSTP) and then stops itself by sending a SIGSTOP to itself. In case of an LDR server, a control command is sent to the clients demanding to stop all processes (RendView in mode "stopping". When a confirmation response was received from all clients, the LDR server finally stops itself (SIGSTOP, mode now "stopped").
This means that when pressing ^Z, RendView and all taska are stopped and you get the shell prompt back.

When using the LDR task source (LDR client) things get more complicated as seen above: After receiving the control command to stop all processes, stopping all processes and sending confirmation to the server, the client goes in "stop" mode which means that it will not launch more jobs, will not talk to the task source and will disable the server keepalive timeout (i.e. will not consider the connection to the server to be broken after some time of inactivity). The client does not stop itself because that would render it completely useless (it could not continue upon request, see SIGCONT below).
Note that all other timeouts (including the client response timeout) stay active. This means that in case you posed a timeout on a render or filter job (-l-r-timeout or similar) or in case a control command was not yet answered by the client, things are likely to fail at the time you continue. [I will fix the client response timeout in a future version if it turns out to be a problem. LDR works fine as long as no non-answered client control command is pending.]
SIGCONT   (Terminal: often fg, bg)
When receiving SIGCONT, RendView will enter "continuing" mide and send SIGCONT to all processes or send a continuation control request to all LDR clients. When all processes are running again (i.e. confirmation request from clients), it enters normal "running" mode again. Note that RendView will do so even if it was not in "stopping" or "stopped" mode which means that you can trigger continue jobs launched by RendView which were stopped by some other means in that way. (It also means that the routines to decide whether to give back/get new tasks and whether to launch a task are re-examined, which may be interesting for bug hunting.)
The LDR client basically un-does all the things it did when receiving the stop control command.
Note that "continuing" and "running" mode are quite the same, RendView will launch new jobs or talk to the task source in both modi. In when "stopping" and when "stopped", these actions are not taken.
If you send a user 1 signal to RendView, the task manager will dump the the state of all internal state variables to the terminal (stderr). This is mainly useful in debugging (e.g. if RendView simply does nothing but waiting or spins busily without good reason)
Sending a SIGUSR2 to RendView will make the taskmanager dump a complete list of all tasks in todo, proc and done list. Also mainly useful in debugging but can also be used to see what is just being done.
NOTE: You will see nothing unless the TDR verbose stream is enabled.
These are signals which cannot be handeled by a user process. Consequently, RendView cannot deal with them gracefully. Always use SIGTERM instead of SIGKILL and SIGTSTP instead of SIGSTOP unless it is absolutely necessary.

Parameters for the taskmanager and driver interface

I've been talking a lot about the "taskmanager" above. As you know from the quick start, this is a little simplification. Because the taskmanager does not do all that alone but uses a task driver interface (which can be of type "local" or "LDR"). The task driver interface is the virtualisation of the different ways tasks can be launched (either locally or via LDR). Consequently, they also take specific options/parameters.

Parameters for the taskmanager

The taskmanager itself does know very many parameters:

This switch selects the basic operation mode. Valid values for MODE are rendview (default), ldrserver and ldrclient.
Normal RendView mode selects the local task source and the local task driver interface.
LDR server operation mode selects the local task source and the LDR task driver interface
while LDR client uses the LDR task source and the local driver interface.
Instead of specifying -optmode, you can also rename (or symlink) the RendView binary name. If RendView is called ldrserver or ldrclient, the operation mode will default to ldrserver or ldrclient respectively. If you call it rendview or completely differently, it will default to normal rendview opmode.
You can use -opmode to override the operation mode set by the binary name.
When used, RendView will detach from the terminal and go into background when starting to work. This is especially useful for LDR clients. The following table lists all possibilities:
Argument Background Closed streams Alternative
-daemon=no no (none) simply do not specify -daemon
-daemon=yes yes stdin simply use -daemon
-daemon=close yes stdin, stdout, stderr
-daemon=noclose yes (none)
You will normally want daemons to be quiet. The most convienent way may be to use -daemon which closes stdin (especially required for ssh connections) and then direct the output streams to some log file:
./rendview -daemon [...] >log 2>&1
-max-failed-in-seq=NUM   (also: -mfis)
When at least this number of tasks failed in sequence (i.e. directly following each other without successful tasks in between), RendView will give up, do not start any more jobs and and schedule quit (i.e. wait for all tasks to finish / clients to quit and quit (local) or recover (LDR) then).
You may set a value of -1 to switch off this feature which is not recommended.
The default value is 3.
This sets a limit on how long RendView may run. This is useful if you may use several boxes for rendering during the night but you have to stop that at e.g. seven o'clock in the morning.
DATE can be specified using either an absolute or a relative time:

Absolute time has the format "[DD.MM.[YYYY]] HH:MM[:SS]" which means that if you want to stop at 19:00 today, you can use -etimeout=19:00, for 19:00 on Mar 21st, use -etimeout="21.3. 19:00".

Relative time has the format "now + {DD | [[HH:]MM]:SS}", so if you want that RendView will not run longer than 7 hours, use -etimeout=now+7:0:0, if you want to limit execution to 7 days, use -etimeout=now+7 (without ":"), for a limit of 30 minutes, use -etimeout=now+30:0.

For testing, you may set -l-nframes=0, then launch RendView and check the line "Execution timeout:" in verbose output.
See also -etimeout-sig
When the execution timeout (as specified with -etimeout) passed, RendView should stop working in some way. Using this option, you can specify how RendView reacts. Possible values for SPEC are:
int: behave like catching one SIGINT.
term: behave like catching one SIGTERM.
abort: Immediately abort. Do not use if avoidable.
See above for RendView's reaction to signals.
This is the run cycle idle timeout which only affects the active task sources, i.e. the LDR client. If the client is idle (meaning not connected to an authenticated LDR server) for more than SEC seconds, then it will terminate (more precisely: behave like catching one SIGINT).
This can be useful if you want clients to terminate automatically when you do not give them jobs for some time.
-load-max, -load-min=VAL
This is the "load control": When specified, RendView will not start jobs when the load is greater or equal -load-max but instead wait until the load is lower than -load-min again. The value VAL is the desired load value multiplied with 100 specified as an integer (i.e. 150 for load 1.5).
This option is probably not too useful. You cannot use it to regulate the number of lauched jobs (try out if you do not believe me). However, if you get told that your rendering may only start jobs if the machine you are sharing with others has a load of below 1 (or so), then this can be used.
See also -load-poll-msec
If the load value is so high that no job may be started, RendView has to check the load continuously to see when it is down again. The checking is done in intervals of length MSEC milliseconds.
See also -load-max.
This is mainly useful in debugging. When re-scheduling is necessary, instruct the taskmanager to not do that immediately but wait MSEC milliseconds before scheduling.
Of course, this defaults to 0 and you should not use it.
It can be used as a crude fix in case the taskmanager spins idle wasting CPU. But better report sich a case as bug to the author.
You get informed on the terminal via verbose output about what happens to a task. Using this switch, you can specify when you want to get informed. The syntax for SPEC is +/-VAL... where VAL consists of one or several letters with the following meaning:
a: dump info on task arrival (LDR)
q: dump info when task is being queued in todo queue
b: dump info when reporting task as done
d: dump info ??? task was and given back/destroyed
r: dump info when rendering is done
f: dump info when filtering is done
+Z: turn all info on
-z: turn all info off
Capital letters mean that you get long info (i.e. complete task dump) while small letters only lead to a one-line short info.
Examples: -dumptask=+QDarf-d (default) or -dumptask=-Rf+QD-r (where the -r will cancel the previous +R). If unsure, try out to see the effect...

Parameters for the local task driver interface

The local task driver interface is used whenever a job (renderer or filter) has to be executed on the local machine (thus in normal RendView and in LDR client operation mode). It knows the following parameters.

Specify the number of simultanious jobs to start. RendView will always try to have NUM many processes running at a time; it may be less but never more.
If RendView can detect the number of CPUs on in the computer, NUM defaults to that value. Otherwise, the default is 1.
In case RendView decides that a job has to be killed, it will first send it s SIGTERM. However, if the job does not terminate within MSEC milliseconds, it will finally kill it using SIGKILL. The default is 1000 msec (1 second).
-ld-todo-thresh-low, -ld-todo-thresh-high, -ld-done-thresh-high=NUM
These are the todo and done queue thresholds.
The task driver will start requesting new tasks from the task source if there are less than todo-thresh-low tasks in the todo queue. This does not apply to the LDR client which gets tasks assigned by the LDR server. The client cannot demand for new tasks; the LDR server has to take care that the clients have enough tasks.
The taskmanager will never store more than todo-thresh-high tasks in the todo queue, i.e. it will stop asking the task source for more tasks when these many tasks are in the todo queue. This also does not apply to the LDR client.
done-thresh-high is the number of tasks which have to accumulate in the done queue before reporting (all the tasts in the queue) back to the task source as "done". Use of 1 for the LDR client is recommended but not mandatory. (It is safest if the LDR client gives back info about successful frames as quickly as possible. High values may prevent the LDR server to give new tasks to the LDR client because it thinks there are already enough tasks assigned to the client which were not yet reported back.)
Defaults should be reasonable.
-ld-r-mute, -ld-r-quiet
Direct render output to /dev/null so that it does not clutter your terminal. Using -ld-r-mute will only tie stdout to /dev/null while -ld-r-quiet do it for both stdout and stderr.
Default: Both switched off. Switching -ld-r-quiet on is recommended.
-ld-r-nice, -ld-f-nice=NVAL
Start render/filter processes with the specified nice value NVAL. Values of 10 to 20 are probably good if other perople or processes also want to run on the box.
See also -ld-r-nice-jitter below.
Default: No nice value.
-ld-r-nice-jitter, -ld-f-nice-jitter
When used, vary nice values randomly by adding or subtracting 1 to prevent the render/filter processes from terminating simultaniously. May not have the desired effect, though. Use -no-ld-r-nice-jitter to switch off.
-ld-r-jobs-max, -ld-f-jobs-max=NUM
Limit the number of simultanious render/filter processes, respectively. Note that -ld-njobs is the overall limit which cannot be exceeded. However, you may find that the filter run so fast that it is sufficient to run one at a time which has the advantage that file filtered frame files will be less fragmented on the hard drive. There may also be other reasons for using this.
If you specify a limit of 0, then no rendering/filtering will be done. This is unwise for normal RendView and LDR server operation because the frames to be filtered will get stuck in todo queue and finally nothing goes on any more.
On the LDR client side, you may use a limit of 0 to make sure that this client does not get frames to be rendered. This works because the client will then report no render/filter descs to the server.
Both values default to -ld-njobs.
-ld-r-timeout, -ld-f-timeout=SEC
Specify a timeout in seconds for the render/filter process. The timeout specifies the maximum time between launching the render/filter process and its termination. In case the timeout is passed the normal SIGTERM, SIGKILL sequence is sent to the process (see -ld-term-kill-delay). Use a value of -1 to disable.
-ld-r-detach-term, -ld-f-detach-term
If you disable these (using e.g. -no-ld-r-detach-term), then you allow the terminal to keep control over the render process. This is not recommended (because of SIGINT, SIGTSTP signal handling).
Default: enabled

Parameters for the LDR task driver interface

The LDR task driver is used by the LDR server. It effectively handles all the LDR server stuff, including all the network and transfer issues. It understands the following options.

The most important option; it specifies a list of LDR clients to use. The syntax is a space-separated list of client specs where each client spec looks like one of "HOST", "HOST/PORT", "HOST/PORT/PASSWORD", "HOST//PASSWORD".
HOST is either an IPv4 address of the host the client is running on, or a domain name which gets resolved via the standard resolve library.
PORT is the TCP port the client listens to. It defaults to the value specified with -Ld-port (see below). PASSWORD is a password for this client. It defaults to the value specified with -Ld-password (see below).
Specify the (default) LDR client TCP port. The default LDR port is 3104.
Specify the (default) client password. See the LDR client description in the tasksource section for more info about the authentication.
Apart from a password string, you may use the following special values:
none: no password (insecure). This is also the case if you do not specify one.
prompt: prompt you for the password (using getpass(3)).
file:PATH: read password from file PATH
(No more than 128 bytes will be read; falls back to prompt if an error occurs or the file is empty.
Specifying the password on the command line is insecure; Using prompt or file: is better, because it will then not show up using ps(1) or top(1) and will not be left in your shell history file. You may also consider passing the password spec using the environment var RENDVIEWARGS (see component data base) but don't pass the literal password there because it may be possible to access the environment as well (Linux users: have a closer look at /proc).
The connection timeout in milliseconds; that is the maximum time allowed to pass between initiating a connection to the client and completing the authentication handshake.
The default is 15 seconds.
Note that this timeout as well as -Ld-rtimeout does not have millisecond precision; values below 1000 (1 second) do not make sense.
Re-connect interval. In case the LDR server could not connect to a client or disconnected during operation for what reason ever, you may want that it re-tries to connect from time to time. The rcinterval option specifies this interval in seconds.
Note that due to internal scheduling, the actual interval time may be up to twice as large (which is not really a problem).
A value of -1 switches off this feature.
Default is 5 minutes.
Send the ping control command to all clients every SEC seconds. This makes sure that they are still up and working because there is a timeout on the response time to all control commands (see -Ld-rtimeout below).
Use a value of -1 to switch that off which is not recommended because the ping is used to detect unreachable clients (for what reason ever: network failure, client computer reboot, etc.).
The default is 30 seconds.
Note that no keepalive ping requests are sent if the connection is busy due to other operation (down/uploading tasks,...) because in these cases the server knows that the client is still there.
Maximum time it may take a client to respond to a control command (like stop/cont/kill tasks, ping, disconnect). The client is considered dead if the response does not arrive within MSEC milliseconds.
Use -1 to disable this feature which is not recommended.
The default is 5 seconds.
Note: You may need to increase this default when large files are transferred. The reason is that the LDR client server connection is one TCP connection used in in full duplex mode. That means the server can send a control request to the client while the client uploads a file to the server. However, the client cannot send the response before the file is uploaded completely. If the whole -Ld-rtimeout timeout passes while uploading the file, the client will get kicked although it should not. There is no easy way to solve that because we may not get trapped by stalled file uploads.
-Ld-todo-thresh-low, -Ld-todo-thresh-righ, -Ld-done-thresh-high=NUM
This works just like the corresponding options of the local task driver interface, see above.
-Ld-max-jobs-per-client, -Ld-max-client-task-thresh=NUM
These are protective parameters. When the server connects to the client, the client reports (to the server) its -ld-njobs (see above) value (number of parallel tasks to start) as well as the "high task threshold" which is the number of tasks which the client would like to have assigned at any time (which is higher than -ld-njobs so that the client always has some tasks around to be able to quickly start new jobs whenever some running jobs terminate -- without having to wait for the LDR server to supply new tasks).
-Ld-max-jobs-per-client specifies the maximum -ld-njobs value accepted from the clients. Higher values will get decreased to -Ld-max-jobs-per-client.
-Ld-max-client-task-thresh is the limit for the "high task thresh" reported by the client and hence limits the number of jobs which are assigned to a client at any time.
Both features can be switched off using a value of -1 (if you trust your clients).
The defaults are 24 and 36, respectively.
-Ld-r-timeout, -Ld-f-timeout=SEC
This specifies the task driver's timeout for render/filter jobs. This is normally not needed as you can set a timeout on the server side using the (local) task source (-l-r-timeout) and on the client side using the local task driver (-ld-r-timeout).
When you use it, the task sent to the client will contain the shortter of the two timeouts (i.e. -l-r-timeout and -Ld-r-timeout).
A value of -1 disables the timeout (default).

The task drivers

Task drivers actually launch the tasks. Currently, there is the POVRay render task driver which supports several versions of POVRay (at least 3.1g and 3.5) as well as a generic filter driver supporting any filter which reads the input image from stdin and writes it to stdout.

Unfortunately, POVRay is a bit bugged when it comes to it's exit status. It returns 0 (success) even if parsing failed and no output was actually generated. Hence, RendView applies some tricks to work around this: It checks if the output file exists and also checks the time stamp: if the modification time is older than the launch time of POVRay, the file obviously did not get touched and rendering is considered as failed even if POVRay returns "success". You can read about "spurious success" in the output in such a case.

The filter driver also checks for the output frame existence but does not apply any time stamp checks (because a filter may decide to actually not touch an image for what reason ever).

<prev [index] next>
Last modified: 2008-02-11 21:06:42 Copyright © 2003 Wolfgang Wieser