Part 1: Hello World¶
A "Hello, World!" is a minimalist example that is meant to demonstrate the basic syntax and structure of a programming language or software framework. The example typically consists of printing the phrase "Hello, World!" to the output device, such as the console or terminal, or writing it to a file.
0. Warmup: Run Hello World directly¶
Let's demonstrate this with a simple command that we run directly in the terminal, to show what it does before we wrap it in Nextflow.
1. Make the terminal say hello¶
2. Now make it write the text output to a file¶
3. Verify that the output file is there using the ls
command¶
4. Show the file contents¶
Tip
In the Gitpod environment, you can also find the output file in the file explorer, and view its contents by clicking on it. Alternatively, you can use the code
command to open the file for viewing.
Takeaway¶
You now know how to run a simple command in the terminal that outputs some text, and optionally, how to make it write the output to a file.
What's next?¶
Learn how to turn that into a step in a Nextflow workflow.
1. Very first Nextflow run¶
Now we're going to run a script (named hello-world.nf
) that does the same thing as before (write 'Hello World!' to a file) but with Nextflow.
Note
We're intentionally not looking at the script yet. Understanding what is the result before we look into the machine will help us understand what each part does.
1. Run the workflow¶
You console should look something like this:
N E X T F L O W ~ version 23.10.1
Launching `hello-world.nf` [mighty_murdock] DSL2 - revision: 80e92a677c
executor > local (1)
[4e/6ba912] process > sayHello [100%] 1 of 1 ✔
Congratulations, you ran your first Nextflow workflow!
The most important output here is the last line (line 4), which reports that the sayHello
process was successfully executed once.
When a Nextflow workflow is run a work
directory that stores various files is created.
Each task uses a unique directory based on its hash (e.g., 4e/6ba912
) within the work directory.
When a task is created, Nextflow stages the task input files, script, and other helper files into the task directory. The task writes any output files to this directory during its execution, and Nextflow uses these output files for downstream tasks and/or publishing.
Warning
Your work directory won't necessarily have the same hash as the one shown above.
Browse the work
directory in the file explorer to find the log files and any outputs created by the task. You should find the following files:
.command.begin
: Metadata related to the beginning of the execution of the process task.command.err
: Error messages (stderr) emitted by the process task.command.log
: Complete log output emitted by the process task.command.out
: Regular output (stdout) by the process task.command.sh
: The command that was run by the process task call.exitcode
: The exit code resulting from the command
In this case, look for your output in the .command.out
file.
Tip
Some of the specifics will be different in your log output. For example, here [mighty_murdock]
and [4e/6ba912]
are randomly generated names, so those will be different every time.
Takeaway¶
You know how to run a simple Nextflow script and navigate the work directory.
What's next?¶
Learn how to interpret the Nextflow code.
2. Interpret the Hello World script¶
Nextflow scripts is built up of multiple parts.
A process is the basic processing primitive to execute a user script.
The process definition starts with the keyword process
, followed by process name and finally the process body delimited by curly braces. The process body must contain a script block which represents the command or, more generally, a script that is executed by it.
A process may contain any of the following definition blocks: directives, inputs, outputs, when clauses, and of course, the script.
A workflow is a composition of processes and dataflow logic.
The workflow definition starts with the keyword workflow
, followed by an optional name, and finally the workflow body delimited by curly braces.
Processes are connected through asynchronous first-in, first-out (FIFO) queues, called channels. The interaction between processes, and ultimately the workflow execution flow itself, are defined by the process input and output declarations.
Let's open the hello-world.nf
script and look at how it's structured.
1. Double click on the file in the file explorer to open it in the editor pane¶
Tip
The file is in the current directory. Optionally, you can type ls
in the terminal and Ctrl+Click on the file to open it. If you're on macOS, you can use Cmd+Click.
The first block of code describes a process called sayHello
that writes its output to stdout
:
The second block of code describes the workflow itself, which consists of one call to the sayHello
process.
2. Add a comment block above the process to document what it does in plain English¶
3. Add an in-line comment above the process call¶
Takeaway¶
You know how to interpret the simplest possible Nextflow script and add comments to document it.
What's next?¶
Learn how to make it output a named file.
3. Send the output to a file¶
Instead of printing "Hello World!" to the standard output it can be saved to a file (it's the same thing we did when running in the terminal earlier).
In a real-world workflow, this is like having a command that specifies an output file as part of its normal syntax. We'll see examples of that later.
Bother the script and the output definition blocks need to be updated.
Note
Inputs and outputs in the process blocks typically require a qualifier and a variable name:
A definition consists of a qualifier and a name. The qualifier defines the type of data to be received. This information is used by Nextflow to apply the semantic rules associated with each qualifier, and handle it properly. Common qualifiers include val
and path
.
1. Change the process command to output a named file¶
Before:
After:
2. Change the output declaration in the process¶
Before:
After:
3. Run the workflow again¶
The log output should be very similar to the first time your ran the workflow:
N E X T F L O W ~ version 23.10.1
Launching `scripts/hello-world.nf` [disturbed_cajal] DSL2 - revision: 9512241567
executor > local (1)
[ab/c61321] process > sayHello [100%] 1 of 1 ✔
Like you did before, find the work
directory in the file explorer. Find the output.txt
output file and click on it to open it and verify that it contains the greeting as expected.
Warning
This example is brittle because we hardcoded the output filename in two separate places (the script and the output blocks). If we change one but not the other, the script will break.
4. Add a publishDir
directive to the process¶
The output is buried in a working directory several layers deep. Nextflow is in control of this directory and we are not supposed to interact with it. To make the output file more accessible, we can utilize the publishDir directive. By specifying this directive, Nextflow will automatically copy the output file to a designated output directory. This allows us to leave the working directroy alone, while still having easy access to the desired output file.
Before:
After:
hello-world.nf | |
---|---|
5. Run the workflow again¶
The log output should start looking very familiar:
N E X T F L O W ~ version 23.10.1
Launching `hello-world.nf` [evil_bose] DSL2 - revision: 6907ac9da2
executor > local (1)
[46/e4ff05] process > sayHello [100%] 1 of 1 ✔
Nextflow will have created a folder called results/
. In this folder is our output.txt
file. If you check the contents it should match our existing output. This is how we move results files outside of the working directories.
Takeaway¶
You know how to send outputs to a specific named file and use the publishDir
directive to move files outside of the Nextflow working directory.
Note
publishDir
has been the primary method of outputting files in Nextflow for a considerable amount of time. However, there is a new syntax that allows you to determine what files should be published at the workflow level, documented here. We expect the new method to largely replace publishDir
in pipelines, but here we teach you to use publishDir
as a convenient, low-effort way to retrieve outputs in the context of this tutorial and during a pipeline development process. This will also ensure that you can read and understand the large number of pipelines that have already been written with publishDir
.
What's next?¶
Learn how to use resume to re-use the cached results
4. Use the Nextflow resume feature¶
Nextflow has an option called -resume
that allows you to re-run a pipeline you've run in a special mode that skips any processes that have already been run with the exact same code, settings and inputs. Using this mode means Nextflow will only run processes that are either new, have been modified or are being provided new settings or inputs.
There are two key advantages to doing this:
- If you're in the middle of developing your pipeline, you can iterate more rapidly since you only effectively have to run the process(es) you're working on to test your changes.
- If you're running a pipeline in production and something goes wrong, in many cases you can fix the issue and relaunch the pipeline, and it will resume running from the point of failure, which can save you a lot of time and compute.
1. Run the workflow again with -resume
¶
The console output should look similar.
N E X T F L O W ~ version 23.10.1
Launching `hello-world.nf` [tiny_elion] DSL2 - revision: 7ad1cd6bfe
executor > local (1)
[8b/1f9ded] process > sayHello [100%] 1 of 1 ✔, cached: 1 ✔
Notice the additional cached
. Nextflow has cached the process and re-used the result. It will also not replace the output file at results/output.txt
.
Takeaway¶
You know how to to relaunch a pipeline without repeating steps that were already executed in an identical way.
What's next?¶
Learn how to add in variable inputs.
5. Add in variable inputs¶
So far, we've been emitting a greeting hardcoded into the process command. Now we're going to add some flexibility by introducing channels.
Nextflow uses channels to feed inputs to processes and ferry data between chained processes. For now, we're just going to use the simplest possible channel, a single value.
1. Create an input channel (with a bonus in-line comment)¶
Before:
After:
hello-world.nf | |
---|---|
2. Add the channel as input to the process call¶
Before:
After:
3. Add an input definition to the process block¶
Before:
After:
hello-world.nf | |
---|---|
4. Edit the process command to use the input variable¶
Before:
After:
5. Run the workflow command again¶
If you made all four edits correctly, you should get another successful execution:
N E X T F L O W ~ version 23.10.1
Launching `hello-world.nf` [maniac_euler] DSL2 - revision: 73bfbe197f
executor > local (1)
[57/aee130] process > sayHello (1) [100%] 1 of 1 ✔
The result is still the same as previously; so far we're just progressively tweaking the internal plumbing to increase the flexibility of our workflow while achieving the same end result.
Takeaway¶
You know how to use a simple channel to provide an input to a process.
What's next?¶
Learn how to pass inputs from the command line.
6. Use params
for inputs¶
We want to be able to specify the input from the command line because that is the piece that will almost always be different in subsequent runs of the workflow. Good news: we can use params
.
1. Edit the input channel declaration to use a parameter¶
Before:
After:
2. Run the workflow again with the --greeting
parameter¶
In case you're wondering, yes it's normal to have dreams where the Nextflow log output scrolls endlessly in front of you after running through a training session... Or is that just me?
N E X T F L O W ~ version 23.10.1
Launching `hello-world.nf` [hopeful_laplace] DSL2 - revision: a8ed9a6202
executor > local (1)
[83/dfbbbc] process > sayHello (1) [100%] 1 of 1 ✔
Be sure to open up the output file to check that you now have the new version of the greeting. Voilà!
Note
A double hyphen (--
) is used to set a params
item while a single hyphen (-
) is used to modify a Nextflow setting, e.g. the -resume
feature we used earlier.
3. Set a default value for a command line parameter¶
In many cases, it makes sense to supply a default value for a given parameter so that you don't have to specify it for every run.
Let's initialize the greeting
parameter with a default value.
4. Add the parameter declaration at the top of the script (with a comment block as a free bonus)¶
5. Run the workflow again without specifying the parameter¶
The console output is expected to look the same...
N E X T F L O W ~ version 23.10.1
Launching `hello-world.nf` [tiny_elion] DSL2 - revision: 7ad1cd6bfe
executor > local (1)
[8b/1f9ded] process > sayHello [100%] 1 of 1 ✔
Check the output in the results directory, and... Tadaa! It works! Nextflow used the default value to name the output. But wait, what happens now if we provide the parameter in the command line?
6. Run the workflow again with the --greeting
parameter on the command line using a different greeting¶
Nextflow's not complaining, that's a good sign:
N E X T F L O W ~ version 23.10.1
Launching `hello-world.nf` [exotic_lichterman] DSL2 - revision: 7ad1cd6bfe
executor > local (1)
[36/47354a] process > sayHello [100%] 1 of 1 ✔
Check the results directory and look at the contents of output.txt
. Tadaa again!
The value of the parameter we passed on the command line overrode the value we gave the variable in the script. In fact, parameters can be set in several different ways; if the same parameter is set in multiple places, its value is determined based on the order of precedence that is described here.
Tip
You can put the parameter declaration inside the workflow block if you prefer. Whatever you choose, try to group similar things in the same place so you don't end up with declarations all over the place.
Takeaway¶
You know how to set up an input variable for a process and supply a value in the command line.
What's next?¶
Learn how to add in a second process and chain them together.
7. Add a second step to the workflow¶
Most real-world workflows involve more than one step. Here we introduce a second process that converts the text to uppercase (all-caps), using the classic UNIX one-liner:
We're going to run the command by itself in the terminal first to verify that it works as expected without any of the workflow code getting in the way of clarity, just like we did at the start with echo 'Hello World'
. Then we'll write a process that does the same thing, and finally we'll connect the two processes so the output of the first serves as input to the second.
1. Run the command in the terminal by itself¶
The output is simply the uppercase version of the text string:
2. Make the command take a file as input and write the output to a file¶
Now the HELLO WORLD
output is in the new output file, UPPER-output.txt
.
3. Turn that into a process definition (documented with a comment block)¶
hello-world.nf | |
---|---|
4. Add a call to the new process in the workflow block¶
hello-world.nf | |
---|---|
5. Pass the output of the first process to the second process¶
6. Run the same workflow command as before¶
Oh, how exciting! There is now an extra line in the log output, which corresponds to the second process we've added:
N E X T F L O W ~ version 23.10.1
Launching `hello-world.nf` [kickass_pasteur] DSL2 - revision: d15b2c482c
executor > local (2)
[da/8d9221] process > sayHello (1) [100%] 1 of 1 ✔
[01/2b32ee] process > convertToUpper (1) [100%] 1 of 1 ✔
This time the workflow produced two work directories; one per process instance (task). Check out the work directory of the task from the second process, where you should find two different output files listed. If you look carefully, you'll notice one of them (the output of the first process) has a little arrow icon on the right; that signifies it's a symbolic link. It points to the location where that file lives in the work directory of the first process.
Note how all we did was connect the output of sayHello
to the input of convertToUpper
and the two processes could be ran in serial. Nextflow did the hard work of handling input and output files and passing them between the two commands for us. This is the power of channels in Nextflow, doing the laborious work of connecting our pipeline steps up together.
Note
As a little bonus, we composed the second output filename based on the first one. Very important to remember: you have to use double quotes around the filename expression (NOT single quotes) or it will fail.
Takeaway¶
You know how to add a second step that takes the output of the first as input.
What's next?¶
Learn how to make the workflow run on many values for the same input.
8. Modify the workflow to run on many values for the same input¶
Workflows typically run on batches of inputs that we want to process in bulk. Here we upgrade the workflow to accept an input with multiple values. For simplicity, we go back to hardcoding the greetings instead of using a parameter for the input.
1. Modify the channel to contain multiple greetings (hardcoded for now)¶
Before:
After:
hello-world.nf | |
---|---|
2. Modify the first process to generate dynamic filenames so the final filenames will be unique¶
Before:
hello-world.nf | |
---|---|
After:
hello-world.nf | |
---|---|
Note
In practice, naming files based on the data input itself is almost always impractical; the better way to generate dynamic filenames is to use a samplesheet and create a map of metadata (aka metamap) from which we can grab an appropriate identifier to generate the filenames. We'll show how to do that later in this training.
3. Run the command and look at the log output¶
How many log lines do you expect to see in the terminal? And how many do you actually see?
N E X T F L O W ~ version 23.10.1
Launching `hello-world.nf` [cranky_hypatia] DSL2 - revision: 719dae218c
executor > local (6)
[6c/91aa50] process > sayHello (3) [100%] 3 of 3 ✔
[90/80111c] process > convertToUpper (3) [100%] 3 of 3 ✔
Something's wrong! The log lines seem to indicate each process was executed three times (corresponding to the three input elements we provided) but we're only seeing two work directories instead of six.
This is because by default, the ANSI logging system writes the logging from multiple calls to the same process on the same line. Fortunately, we can disable that behavior.
4. Run the command again with the -ansi-log false
option¶
This time it works fine, we see six work directories in the terminal:
N E X T F L O W ~ version 23.10.1
Launching `hello-world.nf` [disturbed_panini] DSL2 - revision: 719dae218c
[8c/77b534] Submitted process > sayHello (1)
[b5/f0bf7e] Submitted process > sayHello (2)
[a8/457f9b] Submitted process > sayHello (3)
[3d/1bb4e6] Submitted process > convertToUpper (2)
[fa/58fbb1] Submitted process > convertToUpper (1)
[90/e88919] Submitted process > convertToUpper (3)
That's much better; at least for this number of processes. For a complex workflow, or a large number of inputs, having the full list output to the terminal might get a bit overwhelming.
Tip
Another way to show that all six calls are happening is to delete all the work directories before you run again. Then you'll see the six new ones pop up.
Takeaway¶
You know how to feed an input with multiple elements through a channel.
What's next?¶
Learn how to make the workflow take a file that contains multiple values for an input.
9. Modify the workflow to run on a file that contains an input with multiple values¶
In most cases, when we run on multiple inputs, the input values are contained in a file. Here we're going to use a file where each value is on a new line.
1. Modify the channel declaration to take an input file (through a parameter) instead of a single parameter¶
Before:
hello-world.nf | |
---|---|
After:
hello-world.nf | |
---|---|
This is quite involved so let's run through the changes we have made:
- We used the
Channel.fromPath()
function to create a channel containing any file located at the path specified. - We use the
splitText()
function to read in the contents of that file line-per-line. - We use a closure
{ it.trim() }
to remove any blank lines that were present in the file.
2. Modify the default parameter to point to an input file¶
Before:
After:
3. Run the workflow with the -ansi-log false
option and an --input_file
parameter¶
Once again we see each process get executed three times:
N E X T F L O W ~ version 23.10.1
Launching `hello-world.nf` [small_albattani] DSL2 - revision: 5cea973c3c
[45/18d159] Submitted process > sayHello (1)
[cf/094ea1] Submitted process > sayHello (3)
[27/e3ea5b] Submitted process > sayHello (2)
[7d/63672f] Submitted process > convertToUpper (1)
[62/3184ed] Submitted process > convertToUpper (2)
[02/f0ff38] Submitted process > convertToUpper (3)
Looking at the outputs, we see each greeting was correctly extracted and processed through the workflow. We've achieved the same result as the previous step, but now we have a lot more flexibility to add more elements to the channel of greetings we want to process.
Tip
Don't worry if the channel types and operators, closures etc feel like a lot to grapple with the first time you encounter them. The key learning point is that we can create a channel from a file and then read in the contents of that file. You'll get more opportunities to practice using these components in various settings in later training modules.
Takeaway¶
You know how to provide inputs in a file.
What's next?¶
Celebrate your success and take a break!
When you are ready, move on to Part 2 of this training to learn how to apply what you've learned to a more realistic data analysis use case.