Dataflows Components: Task and Workflow ======================================= A *Task* is the basic runnable component of *Pydra* and is described by the class ``TaskBase``. A *Task* has named inputs and outputs, thus allowing construction of dataflows. It can be hashed and executes in a specific working directory. Any *Pydra*'s *Task* can be used as a function in a script, thus allowing dual use in *Pydra*'s *Workflows* and in standalone scripts. There are several classes that inherit from ``TaskBase`` and each has a different application: Function Tasks -------------- * ``FunctionTask`` is a *Task* that executes Python functions. Most Python functions declared in an existing library, package, or interactively in a terminal can be converted to a ``FunctionTask`` by using *Pydra*'s decorator - ``mark.task``. .. code-block:: python import numpy as np from pydra import mark fft = mark.annotate({'a': np.ndarray, 'return': float})(np.fft.fft) fft_task = mark.task(fft)() result = fft_task(a=np.random.rand(512)) `fft_task` is now a *Pydra* *Task* and result will contain a *Pydra*'s ``Result`` object. In addition, the user can use Python's function annotation or another *Pydra* decorator --- ``mark.annotate`` in order to specify the output. In the following example, we decorate an arbitrary Python function to create named outputs: .. code-block:: python @mark.task @mark.annotate( {"return": {"mean": float, "std": float}} ) def mean_dev(my_data): import statistics as st return st.mean(my_data), st.stdev(my_data) result = mean_dev(my_data=[...])() When the *Task* is executed `result.output` will contain two attributes: `mean` and `std`. Named attributes facilitate passing different outputs to different downstream nodes in a dataflow. .. _shell_command_task: Shell Command Tasks ------------------- * ``ShellCommandTask`` is a *Task* used to run shell commands and executables. It can be used with a simple command without any arguments, or with specific set of arguments and flags, e.g.: .. code-block:: python ShellCommandTask(executable="pwd") ShellCommandTask(executable="ls", args="my_dir") The *Task* can accommodate more complex shell commands by allowing the user to customize inputs and outputs of the commands. One can generate an input specification to specify names of inputs, positions in the command, types of the inputs, and other metadata. As a specific example, FSL's BET command (Brain Extraction Tool) can be called on the command line as: .. code-block:: python bet input_file output_file -m Each of the command argument can be treated as a named input to the ``ShellCommandTask``, and can be included in the input specification. As shown next, even an output is specified by constructing the *out_file* field form a template: .. code-block:: python bet_input_spec = SpecInfo( name="Input", fields=[ ( "in_file", File, { "help_string": "input file ...", "position": 1, "mandatory": True } ), ( "out_file", str, { "help_string": "name of output ...", "position": 2, "output_file_template": "{in_file}_br" } ), ( "mask", bool, { "help_string": "create binary mask", "argstr": "-m", } ) ], bases=(ShellSpec,) ) ShellCommandTask(executable="bet", input_spec=bet_input_spec) More details are in the :ref:`Input Specification section`. Container Tasks --------------- * ``ContainerTask`` class is a child class of ``ShellCommandTask`` and serves as a parent class for ``DockerTask`` and ``SingularityTask``. Both *Container Tasks* run shell commands or executables within containers with specific user defined environments using Docker_ and Singularity_ software respectively. This might be extremely useful for users and projects that require environment encapsulation and sharing. Using container technologies helps improve scientific workflows reproducibility, one of the key concept behind *Pydra*. These *Container Tasks* can be defined by using ``DockerTask`` and ``SingularityTask`` classes directly, or can be created automatically from ``ShellCommandTask``, when an optional argument ``container_info`` is used when creating a *Shell Task*. The following two types of syntax are equivalent: .. code-block:: python DockerTask(executable="pwd", image="busybox") ShellCommandTask(executable="ls", container_info=("docker", "busybox")) Workflows --------- * ``Workflow`` - is a subclass of *Task* that provides support for creating *Pydra* dataflows. As a subclass, a *Workflow* acts like a *Task* and has inputs, outputs, is hashable, and is treated as a single unit. Unlike *Tasks*, workflows embed a directed acyclic graph. Each node of the graph contains a *Task* of any type, including another *Workflow*, and can be added to the *Workflow* simply by calling the ``add`` method. The connections between *Tasks* are defined by using so called *Lazy Inputs* or *Lazy Outputs*. These are special attributes that allow assignment of values when a *Workflow* is executed rather than at the point of assignment. The following example creates a *Workflow* from two *Pydra* *Tasks*. .. code-block:: python # creating workflow with two input fields wf = Workflow(input_spec=["x", "y"]) # adding a task and connecting task's input # to the workflow input wf.add(mult(name="mlt", x=wf.lzin.x, y=wf.lzin.y)) # adding another task and connecting # task's input to the "mult" task's output wf.add(add2(name="add", x=wf.mlt.lzout.out)) # setting workflow output wf.set_output([("out", wf.add.lzout.out)]) Task's State ------------ All Tasks, including Workflows, can have an optional attribute representing an instance of the State class. This attribute controls the execution of a Task over different input parameter sets. This class is at the heart of Pydra's powerful Map-Reduce over arbitrary inputs of nested dataflows feature. The State class formalizes how users can specify arbitrary combinations. Its functionality is used to create and track different combinations of input parameters, and optionally allow limited or complete recombinations. In order to specify how the inputs should be split into parameter sets, and optionally combined after the Task execution, the user can set splitter and combiner attributes of the State class. .. code-block:: python task_with_state = add2().split(x=[1, 5]).combine("x") In this example, the ``State`` class is responsible for creating a list of two separate inputs, *[{x: 1}, {x:5}]*, each run of the *Task* should get one element from the list. Note that in this case the value for `x` is set in the `split()` method, not at the task's initialisation. The `combine()` method, specifies that the results are grouped back when returning the result from the *Task*. While this example illustrates mapping and grouping of results over a single parameter, *Pydra* extends this to arbitrary combinations of input fields and downstream grouping over nested dataflows. Details of how splitters and combiners power *Pydra*'s scalable dataflows are described in the next section. .. _Docker: https://www.docker.com/ .. _Singularity: https://www.singularity.lbl.gov/