Contents

Native Code Packaging

When your model is written in a language for which a specific OpenMOLE task doesn't exist, or if it uses an assembly of tools, libraries, binaries, etc.. you have to package it in a single piece of executable code, so that you can send it to any machine (with potentially varying OS installations). You will learn to do that in this page, thanks to the CARE packaging tool.

The utility of the CARE tool for reproducible science has been covered in the following paper :

Jonathan Passerat-Palmbach, Romain Reuillon, Mathieu Leclaire, Antonios Makropoulos, Emma C. Robinson, Sarah Parisot and Daniel Rueckert, Reproducible Large-Scale Neuroimaging Studies with the OpenMOLE Workflow Management System, published in Frontiers in Neuroinformatics Vol 11, 2017.
[online version] [bibteX]

Packaging with CARE

In OpenMOLE, a generic task named CARETask offers to run external applications packaged with CARE. The site (proposing an outdated version of CARE for now, but a great documentation) can be found here. CARE makes it possible to package your application from any Linux computer, and then re-execute it on any other Linux computer. The CARE / OpenMOLE pair is a very efficient way to distribute your application at very large scale with very little effort. Please note that this packaging step is only necessary if you plan distribute your workflow to an heterogeneous computing environment such as the EGI grid. If you target local clusters, running the same operating system and sharing a network file system, you can directly jump to the SystemExecTask.

You should first install CARE:

download the CARE tool from here
make it executable (chmod +x care)
add the path to the executable to your PATH variable (export PATH=/path/to/the/care/folder:$PATH)

The CARETask was designed to embed native binaries such as programs compiled from C, C++, Fortran, Python, R, Scilab... Embedding an application in a CARETask happens in 2 steps:

First you should package your application using the CARE binary you just installed, so that it executes on any Linux environment. This usually consists in prepending your command line with:
care -o /path/to/myarchive.tgz.bin -r ~ -p /path/to/mydata1 -p /path/to/mydata2 mycommand myparam1 myparam2
Before going any further, here are a few notes about the options accepted by CARE:

-o indicates where to store the archive. At the moment, OpenMOLE prefers to work with archives stored in .tgz.bin so please don't toy with the extension ;-)
-r ~ is not compulsory but it has proved mandatory in some cases. So as rule of thumb, if you encounter problems when packaging your application, try adding / removing it.
-p /path asks CARE not to archive /path. This is particularly useful for input data that will change with your parameters. You probably do not want to embed this data in the archive, and we'll see further down how to inject the necessary input data in the archive from OpenMOLE.

Second, just provide the resulting package along with some other information to OpenMOLE. Et voila! If you encounter any problem to package your application, please refer to the corresponding entry in the FAQ

One very important aspect of CARE is that you only need to package your application once. As long as the execution you use to package your application makes uses of all the dependencies (libraries, packages, ...), you should not have any problem re-executing this archive with other parameters.

Advanced Options

Return value

The CARETask can be customised to fit the needs of a specific application. For instance, some applications disregarding standards might not return the expected 0 value upon completion. The return value of the application is used by OpenMOLE to determine whether the task has been successfully executed, or needs to be re-executed. Setting the boolean flag errorOnReturnValue to false will prevent OpenMOLE from re-scheduling a CARETask that has reported a return code different from 0. You can also get the return code in a variable using the returnValue setting.

Standard and error outputs

Another default behaviour is to print the standard and error outputs of each task in the OpenMOLE console. Such raw prints might not be suitable when a very large number of tasks is involved or that further processing are to be performed on the outputs. A CARETask's standard and error outputs can be assigned to OpenMOLE variable and thus injected in the dataflow by summoning respectively the stdOut and stdErr actions on the task.

Environment variables

As any other process, the applications contained in OpenMOLE's native tasks accept environment variables to influence their behaviour. Variables from the dataflow can be injected as environment variables using the environmentVariable += (variable, "variableName") field. If no name is specified, the environment variable is named after the OpenMOLE variable. Environment variables injected from the dataflow are inserted in the pre-existing set of environment variables from the execution host. This shows particularly useful to preserve the behaviour of some toolkits when executed on local environments (ssh, clusters, ...) where users control their work environment.

The following snippet creates a task that employs the features described in this section:

// Declare the variable
val output = Val[String]
val error  = Val[String]
val value = Val[Int]

// Any task
val pythonTask =
  CARETask("hello.tgz.bin", "python hello.py") set (
    stdOut := output,
    stdErr := error,
    returnValue := value,
    environmentVariable += (value, "I_AM_AN_ENV_VAR")
  )

You will note that options holding a single value are set using the := operator. Also, the OpenMOLE variables containing the standard and error outputs are automatically marked as outputs of the task, and must not be added to the outputs list.

Native API

You can configure the execution of the CARETask using the set operator on a freshly defined task.

val out = Val[Int]

val careTask = CARETask("care.tgz.bin", "executable arg1 arg2 /path/to/my/file /virtual/path arg4") set (
  hostFiles += ("/path/to/my/file"),
  customWorkDirectory := "/tmp",
  returnValue := out
)

The available options are described hereafter:

hostFiles: takes the path of a file on the execution host and binds it to the same path in the CARE filesystem. Optionally you can provide a second argument to specify the path explicitly. Example: hostFiles += ("/etc/hosts") or with a specific path hostFiles += ("/etc/bash.bashrc", "/home/foo/.bashrc")
environmentVariables: is used to set the value of an environment variable within the context of the execution. Example: environmentVariables += ("VARIABLE1", "42"). Multiple hostFiles entries can be used within the same set block.
workDirectory: sets the directory within the archive where to start the execution from. Example: workDirectory := "/tmp"
returnValue: captures the return code of the execution in an OpenMOLE Val[Int] variable. Example: returnValue := out
errorOnReturnValue: tells OpenMOLE to ignore a return code different from 0. The task won't be resubmitted. Example: errorOnReturnValue := false
stdOut: captures the standard output of the execution in an OpenMOLE Val[String] variable. Example: stdOut := output
stdErr: captures the error ouput of the execution in an OpenMOLE Val[String] variable. Example: stdErr := error

Using local Resources

To access data present on the execution node (outside the CARE filesystem) you should use a dedicated option of the CARETask: hostFiles. This option takes the path of a file on the execution host and binds it to the same path in the CARE filesystem. Optionally you can provide a second argument to specify the path explicitly. For instance:

val careTask = CARETask("care.tgz.bin", "executable arg1 arg2 /path/to/my/file /virtual/path arg4") set (
  hostFiles += ("/path/to/my/file"),
  hostFiles += ("/path/to/another/file", "/virtual/path")
)

This CAREtask will thus have access to /path/to/my/file and /virtual/path.

Using local executable

The CARETask was designed to be portable from one machine to another. However, some use-cases require executing specific commands installed on a given cluster. To achieve that you should use another task called SystemExecTask. This task is made to launch native commands on the execution host. There is two modes for using this task:

Calling a command that is assumed to be available on any execution node of the environment. The command will be looked for in the system as it would from a traditional command line: searching in the default PATH or an absolute location.
Copying a local script not installed on the remote environment. Applications and scripts can be copied to the task's work directory using the resources field. Please note that contrary to the CARETask, there is no guarantee that an application passed as a resource to a SystemExecTask will re-execute successfully on a remote environment

The SystemExecTask accepts an arbitrary number of commands. These commands will be executed sequentially on the same execution node where the task is instantiated. In other words, it is not possible to split the execution of multiple commands grouped in the same SystemExecTask.

The following example first copies and runs a bash script on the remote host, before calling the remote's host /bin/hostname. Both commands' standard and error outputs are gathered and concatenated to a single OpenMOLE variable: respectively stdOut and stdErr:

// Declare the variable
val output = Val[String]
val error  = Val[String]

// Any task
val scriptTask =
  SystemExecTask("bash script.sh", "hostname") set (
    resources += workDirectory / "script.sh",
    stdOut := output,
    stdErr := error
  )

 scriptTask hook ToStringHook()

In this case the bash script might depend on applications installed on the remote host. Similarly, we assume the presence of /bin/hostname on the execution node. Therefore this task cannot be considered as portable.

Note that each execution is isolated in a separate folder on the execution host and that the task execution is considered as failed if the script returns a value different from 0. If you need another behaviour you can use the same advanced options as the CARETask regarding the return code.

CARE Troubleshooting

You should always try to re-execute your application outside of OpenMOLE first. This allows you to ensure the packaging process with CARE was successful. If something goes wrong at this stage, you should check the official CARE documentation or the archives of the CARE mailing list.

If the packaged application re-executes as you'd expect, but you still struggle to embed it in OpenMOLE, then get in touch with our user community via our the OpenMOLE user forum or chat.