The HWRF system can be run through the Rocoto workflow manager. The full end-to-end operational system, and many developmental variants, can all be run through the run_hwrf.py (see: run_hwrf_main) script, which is a wrapper around Rocoto's rocotorun command. The Rocoto workflow manager submits jobs, monitors running jobs and tracks file and time dependencies. A suite of XML files in the rocoto/ subdirectory tell Rocoto all of these dependencies, and how to run the scripts (see Scripting Layer).
Note that the multi-storm HWRF has its own special page with details on running HWRF via Rocoto. Make sure you read that page too before attempting the multi-storm configuration.
Starting HWRF Using Rocoto
The rocotorun program is the main program for Rocoto, but it requires an XML file to run. The run_hwrf.py wrapper script (see run_hwrf_main) creates this XML file, finds rocotorun, and executes it. The run_hwrf.py program is executed like so:
2 me@cluster> ./run_hwrf.py -w isaac.xml -d isaac.db 2012 09L HISTORY config.EXPT=myexpt
The myexpt
should be the name of the directory in which HWRF was installed (parent of the ush, rocoto, exec, etc. directories). For example, if you install HWRF here:
1 /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557
Then your config.EXPT should be trunk-r3557:
2 me@cluster> ./run_hwrf.py -w isaac.xml -d isaac.db 2012 09L HISTORY config.EXPT=trunk-r3557
If all goes according to plan, you should see a isaac.xml file show up, filled with Rocoto XML workflow information, and a isaac.db SQLite3 database file, filled with Rocoto internal information about tasks and cycles. The first job in the workflow, scripts.exhwrf_launch, will be running for the first cycle.
After launching the first job, you must continue running the run_hwrf.py, every 5-20 minutes or so. Every time it is run, Rocoto will check dependencies and submit new jobs.
1 me@cluster> ./run_hwrf.py -f -w isaac.xml -d isaac.db 2012 09L HISTORY config.EXPT=trunk-r3557
Note the added -f
argument. This tells run_hwrf.py that the isaac.xml and isaac.db files are already present, and that you approve of the script replacing them. If you omit the -f
then run_hwrf.py will ask you if you want to replace them.
Sanity Checks and Other Paranoia
Several things can go wrong if your installation is not quite right. Either the HWRF sanity checks or Rocoto will warn you about them. HWRF will print a message about sanity checks failing and suggestions about how to fix the problem. Rocoto will print messages about timeouts and hangups, or XML syntax errors. The most common problems are:
- Wrong
config.EXPT
— The config.EXPT
argument must be set to the name of the directory above ush/ otherwise HWRF will use the wrong files, or not find its files at all.
- Fix files missing — Make sure you have the fix files. There should be a fix/ directory or symbolic link at the top level of the installation. See the README.fix file for information on where the fix files reside.
- Executables missing — Remember to compile and install the executables. If you are using data assimilation, you need the exec/hwrf_gsi too.
- Batch system is down or very slow — The batch system is the server on the supercomputer that keeps track of what jobs are running, and decides which compute nodes will run new jobs. Sometimes that batch system will go down, or will be unusably slow. This is especially a problem on WCOSS, where the batch system commands will sometimes take 1-2 minutes to respond. If you see error messages from Rocoto about hangups or timeouts, this is likely a batch system problem.
- Huge flood of syntax errors about XML problems — If you modify the .xml or *.ent files manually, and do it wrong, LibXML2 will report numerous syntax errors.
- Jobs fail randomly when running WRF, prep_hybrid or the post — The HWRF system needs a lot of stack. Make sure your stack limit is at least 2 GB.
Alternative Configurations
In the run_hwrf.py command, you can provide configuration information after the config.EXPT=
option. Additional arguments are configuration files to read or specific options to override. You can find some configuration files in the parm directory. For example, this would run the 3km version of HWRF:
1 me@cluster> ./run_hwrf.py -f -w isaac.xml -d isaac.db 2012 09L HISTORY config.EXPT=trunk-r3557 ../parm/hwrf_3km.conf
You can run a GEFS ensemble at 3km with 43 levels, but then you must also provide the list of ensemble members to run at the beginning of the run_hwrf.py command and some additional config files. Lets say we want to run all twenty-one members (control and 20 perturbed):
1 me@cluster> ./run_hwrf.py -f -w isaac.xml -d isaac.db 00-20 2012 09L HISTORY config.EXPT=trunk-r3557 ../parm/hwrf_3km.conf ../parm/hwrf_ensemble.conf
The changes we added to the command are:
00-20
— The list of ensemble members.
../parm/hwrf_43lev.conf — Request forty-three vertical levels with a 50 mbar top.
../parm/hwrf_3km.conf — Request the 3km configuration.
../parm/hwrf_ensemble.conf — Request a GEFS-based HWRF ensemble.
See HWRF Experiment Configuration for more detail about the available alternate configurations of HWRF.
Monitoring the Rocoto-Based HWRF System
There are a few commands in Rocoto 1.2 and later that let you monitor a running HWRF system and request re-submission of failed jobs.
Automatic Re-Submission
Rocoto will automatically resubmit failed jobs until some threshold for the maximum number of retries is met. This automatic re submission is critical when running on NOAA R&D networks since the hardware and software is not 100% reliable. HWRF's Rocoto XML workflow generally sets this limit to 3 for most jobs, and 7 for jobs that access the network. After that limit is reached, the user must intervene.
The rocotostat Command
The first is rocotostat. The rocotostat command will print a list of all jobs for cycles Rocoto has started. For each job, rocotostat will print the job status, the number of times Rocoto resubmitted and other information:
1 me@cluster> rocotostat -w isaac.xml -d isaac.db -c ALL
The output will look like this:
1 CYCLE TASK JOBID STATE EXIT STATUS TRIES DURATION
2 ==================================================================================================================
3 201208201200 ensda_pre 43320461 SUCCEEDED 0 1 46.0
4 201208201200 launch_E99 43320156 SUCCEEDED 0 1 23.0
5 201208201200 input_E99 43320283 SUCCEEDED 0 1 214.0
6 201208201200 bdy_E99 43321045 SUCCEEDED 0 1 1906.0
7 201208201200 init_GFS_0_E99 43320462 SUCCEEDED 0 1 770.0
8 201208201200 init_GDAS1_3_E99 43320463 SUCCEEDED 0 1 1319.0
9 201208201200 init_GDAS1_6_E99 43320464 SUCCEEDED 0 1 1281.0
10 201208201200 init_GDAS1_9_E99 43320465 SUCCEEDED 0 1 1295.0
11 201208201200 ocean_init_E99 43320466 SUCCEEDED 0 1 883.0
12 201208201200 relocate_GFS_0_E99 43321046 SUCCEEDED 0 1 772.0
13 201208201200 relocate_GDAS1_3_E99 43321341 SUCCEEDED 0 1 771.0
14 201208201200 relocate_GDAS1_6_E99 43321342 SUCCEEDED 0 1 944.0
15 201208201200 relocate_GDAS1_9_E99 43321343 SUCCEEDED 0 1 1020.0
16 201208201200 bufrprep_E99 43320467 SUCCEEDED 0 1 16.0
17 201208201200 gsi_d02_E99 43321974 SUCCEEDED 0 1 2009.0
18 201208201200 gsi_d03_E99 43321975 SUCCEEDED 0 1 1059.0
19 201208201200 gsi_post_E99 43332079 SUCCEEDED 0 1 1702.0
20 201208201200 merge_E99 43332080 SUCCEEDED 0 1 525.0
21 201208201200 check_init_E99 43332318 SUCCEEDED 0 1 15.0
22 201208201200 coupled_forecast_E99 43332438 SUCCEEDED 0 1 10693.0
23 201208201200 uncoupled_forecast_E99 - - - - -
24 201208201200 unpost_E99 43335424 SUCCEEDED 0 1 20.0
25 201208201200 post_E99 43335547 DEAD 0 3 10112.0
26 201208201200 post_helper_E99 43335548 DEAD 0 3 10105.0
27 201208201200 products_E99 43336031 DEAD 0 3 10905.0
28 201208201200 output_E99 - - - - -
Some of the jobs are "DEAD," and two jobs were never even submitted (the jobs with hyphens). A job is dead once it fails too many times. At that point, you have to resubmit it. There are two ways to convince Rocoto to resubmit something.
The rocotorewind Command
This is the preferred method to failed job resubmission. The rocotorewind command will tell Rocoto to "undo" the effects of the job, and resubmit it again when the dependencies are met. This command would tell Rocoto to rewind the three failed jobs:
1 me@cluster> rocotorewind -w isaac.xml -d isaac.db -c 201208201200 -t post_E99 -t post_helper_E99 -t products_E99
2 me@cluster> ./run_hwrf.py -f 00-20 -w isaac.xml -d isaac.db 2012 09L HISTORY ... other arguments ...
Later runs of run_hwrf.py will submit jobs once the dependencies are met. In this case, the post and post_helper jobs will be resubmitted as soon as you run run_hwrf again. The products job will be started once the post job is running or completed.
- Todo:
- At present, it is not possible to use rocotorewind if the cycle is entirely complete (that is, when the special "completion" job is done). This is a limitation of Rocoto that we hope to fix in the next release.
The rocotoboot Command
The alternative is to run rocotoboot, which tells Rocoto to run a job immediately even if the retry limit is reached. The problem with rocotoboot is that it ignores dependencies, so you need to make sure dependencies are met on your own. We know the post's dependencies are met, so we can submit that job:
1 me@cluster> rocotoboot -w isaac.xml -d isaac.db -c 201208201200 -t post_E99
2 me@cluster> ./run_hwrf.py -f 00-20 -w isaac.xml -d isaac.db 2012 09L HISTORY ... other arguments ...
- Warning
- Beware though, that if you rocotoboot a job that is not DEAD, and whose dependencies are met, Rocoto will submit the same job twice. It is better to use rocotorewind, instead of rocotoboot. The rocotorewind command ensures the jobs are only submitted once it is safe to do so, and are only submitted once at a time.
The rocotocheck Command
1 me@cluster> rocotocheck -w *xml -d *db -c 201208201800 -t uncoupled_forecast_E99
The output will look something like this:
1 Task: uncoupled_forecast_E99
3 command: /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557/ush/rocoto_pre_job.sh /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557/scripts/exhwrf_forecast.py
6 jobname: hwrf_atm_forecast_09L_2012082018_E99
10 name: uncoupled_forecast_E99
11 nodes: 30:ppn=12+8:ppn=11+2:ppn=8
14 stderr: /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/2012082018/09L/hwrf_atm_forecast.err
15 stdout: /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/2012082018/09L/hwrf_atm_forecast.out
19 CONFhwrf ==> /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/com/2012082018/09L/storm1.conf
20 HOMEhwrf ==> /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557
21 HWRF_FORCE_TMPDIR ==> /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/2012082018/09L/tmpdir
23 PYTHONPATH ==> /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557/ush
25 USHhwrf ==> /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557/ush
27 WORKhwrf ==> /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/2012082018/09L
29 jlogfile ==> /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/log/jlogfile
30 /bin/sh -c grep -v 'RUN_COUPLED=YES' /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/com/2012082018/09L/storm1.ocean_status (Mon Aug 20 18:00:00 UTC 2012)
33 check_init_E99 of cycle 201208201800 is SUCCEEDED
36 grep -v 'RUN_COUPLED=YES' /pan2/projec... returned false
39 Valid for this task: YES
41 Activated: Tue Jun 02 20:09:53 UTC 2015
45 Job: This task has not been submitted for this cycle
47 Task can not be submitted because:
48 Dependencies are not satisfied
Note the last bit: "Task can not be submitted because: Dependencies
are not satisfied." That tells you why the task is not submitted, but why are the dependencies not satisfied? Look at this part:
3 check_init_E99 of cycle 201208201800 is SUCCEEDED
6 grep -v 'RUN_COUPLED=YES' /pan2/projec... returned false
That is the Rocoto dependency decision tree, which tells you why the dependencies are not met. In this case, the upstream check_init_E99 job is done, so the OR condition is the problematic part. It contains two conditions, only one of which must be met. The first is a string comparison: "YES" must not equal "YES". Clearly that condition will never be met. The other condition is a real one:
1 grep -v 'RUN_COUPLED=YES' /pan2/projec... returned false
It appears that the ocean init decided to run coupled. Hence the uncoupled forecast was never run, and the coupled forecast was run. This is not an error, so we can be satisfied that Rocoto is working as requested.
More Information About Rocoto
The Rocoto project is based on GitHub:
https://github.com/christopherwharrop/rocoto
Documentation is available in the Rocoto wiki:
https://github.com/christopherwharrop/rocoto/wiki/Documentation