I upgraded the system from 2 to 3 yesterday, and it seems like a great improvement! Landsat is downloading again and i like the new options very much, so thanks for making this happen.
However, both custom jobs and scheduled L3B jobs keep stuck at 0/18, without errors. The system was overloaded which created low diskspace after upgrading as it initiated jobs for all sites. i suspect from previous forum topics that it has to do with SLURM and/or the sen2agri-services.
Orchestrator log shows:
● sen2agri-orchestrator.service - Orchestrator for Sen2Agri
Loaded: loaded (/usr/lib/systemd/system/sen2agri-orchestrator.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2022-01-06 09:09:18 CET; 15min ago
Main PID: 3313 (sen2agri-orches)
Tasks: 5
Memory: 5.4M
CGroup: /system.slice/sen2agri-orchestrator.service
└─3313 /usr/bin/sen2agri-orchestrator
Jan 06 09:14:39 localhost.localdomain sen2agri-orchestrator[3313]: Processing job cancelled event with job id 5163
Jan 06 09:14:39 localhost.localdomain sen2agri-orchestrator[3313]: Processing job cancelled event with job id 5184
Jan 06 09:14:39 localhost.localdomain sen2agri-orchestrator[3313]: Processing job paused event with job id 5158
Jan 06 09:14:39 localhost.localdomain sen2agri-orchestrator[3313]: Processing job paused event with job id 5159
Jan 06 09:14:49 localhost.localdomain sen2agri-orchestrator[3313]: Processing job paused event with job id 5157
Jan 06 09:14:49 localhost.localdomain sen2agri-orchestrator[3313]: Processing job paused event with job id 5156
Jan 06 09:14:49 localhost.localdomain sen2agri-orchestrator[3313]: Processing job paused event with job id 5155
Jan 06 09:18:39 localhost.localdomain sen2agri-orchestrator[3313]: Processing product available event with product id 5668
Jan 06 09:25:09 localhost.localdomain sen2agri-orchestrator[3313]: Processing job submitted event with job id 5185 and processor id 3
Jan 06 09:25:10 localhost.localdomain sen2agri-orchestrator[3313]: Processing task runnable event with processor id 3 task id 83574 and job id 5185
The issue led me to this topic:
Going into the services worked, but after i got the following error:
Last login: Thu Jan 6 00:54:29 CET 2022 on pts/2
[sen2agri-service@localhost ~]$ srun ls -al
bash: srun: command not found…
Trying the following commands from that post yields this:
Some extra content, after trying to install Slurm only:
Thanks for using MariaDB!
spawn mysql -u root -p -e create database slurm_acct_db;create user slurm@localhost;
set password for slurm@localhost = password(‘sen2agri’);grant usage on . to slurm;grant all privileges on slurm_acct_db.* to slurm;flush privileges;
Enter password:
ERROR 1007 (HY000) at line 1: Can’t create database ‘slurm_acct_db’; database exists
Failed to start slurmdbd.service: Unit not found.
Failed to execute operation: No such file or directory
Unit slurmdbd.service could not be found.
SLURM DB SERVICE:
./install_slurm_only.sh: line 146: sacctmgr: command not found
mkdir: cannot create directory ‘/var/spool/slurm’: File exists
mkdir: cannot create directory ‘/var/log/slurm’: File exists
Failed to start slurmctld.service: Unit not found.
Failed to execute operation: No such file or directory
Unit slurmctld.service could not be found.
SLURM CTL SERVICE:
Failed to start slurmd.service: Unit not found.
Failed to execute operation: No such file or directory
Unit slurmd.service could not be found.
SLURM NODE SERVICE:
Failed to start slurm.service: Unit not found.
Failed to execute operation: No such file or directory
Unit slurm.service could not be found.
SLURM SERVICE:
./install_slurm_only.sh: line 223: sacctmgr: command not found
./install_slurm_only.sh: line 226: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 240: sacctmgr: command not found
CLUSTER,USERS,QOS INFO:
./install_slurm_only.sh: line 244: sacctmgr: command not found
QOS INFO:
./install_slurm_only.sh: line 247: sacctmgr: command not found
Partition INFO:
./install_slurm_only.sh: line 250: scontrol: command not found
Nodes INFO:
./install_slurm_only.sh: line 253: scontrol: command not found
For me seems it that SLURM is not installed. Could you try:
sudo yum -y install slurm slurm-slurmctld slurm-slurmd slurm-devel slurm-pam_slurm slurm-perlapi slurm-slurmdbd slurm-torque slurm-libs
I’m also struggeling with L3b-Processor after updating to 3.0. The orchestrator is encountering a network error. Do you have any idea where this could come from? In the monitoring-tab the job stays as “submitted” and never starts to run.
● sen2agri-orchestrator.service - Orchestrator for Sen2Agri
Loaded: loaded (/usr/lib/systemd/system/sen2agri-orchestrator.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2022-01-13 13:47:25 UTC; 25min ago
Main PID: 4492 (sen2agri-orches)
Tasks: 5
Memory: 4.4M
CGroup: /system.slice/sen2agri-orchestrator.service
└─4492 /usr/bin/sen2agri-orchestrator
Jan 13 13:47:25 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Reading settings from /etc/sen2agri/sen2agri.conf
Jan 13 13:47:25 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Reading settings from /etc/sen2agri/sen2agri.conf
Jan 13 13:47:25 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Invalid processor configuration found in database: l2-s1, igoring it as no handler is available for it!
Jan 13 13:47:25 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Invalid processor configuration found in database: lpis, igoring it as no handler is available for it!
Jan 13 13:47:26 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: GetNewEvents took 304 ms
Jan 13 13:58:15 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Processing job submitted event with job id 12885 and processor id 3
Jan 13 13:58:16 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Reading settings from /etc/sen2agri/sen2agri.conf
Jan 13 13:58:17 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Processing task runnable event with processor id 3 task id 51748 and job id 12885
Jan 13 13:58:17 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Network error while invoking executor function 1
Jan 13 13:58:17 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Network error while invoking executor function 1
Hi Cosmin, thank you for your suggestion. i did execute that command but all seems to be installed.
Loading mirror speeds from cached hostfile
* base: mirrors.supportex.net
* epel: mirror.hostnet.nl
* extras: mirrors.supportex.net
* updates: mirror.nforce.com
Package slurm-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-slurmctld-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-slurmd-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-devel-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-pam_slurm-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-perlapi-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-slurmdbd-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-torque-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-libs-20.11.8-2.el7.x86_64 already installed and latest version
Nothing to do
also, after restarting the processors, i get the following slurm status updates:
~
]$ systemctl status slurm{dbd,ctld,d}
● slurmdbd.service - Slurm DBD accounting daemon
Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2022-01-13 14:56:04 CET; 1 weeks 0 days ago
Main PID: 1971 (code=exited, status=1/FAILURE)
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2022-01-13 14:55:59 CET; 1 weeks 0 days ago
Main PID: 2006 (code=exited, status=1/FAILURE)
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2022-01-13 14:56:01 CET; 1 weeks 0 days ago
Main PID: 3466 (slurmd)
Tasks: 2
Memory: 5.9M
CGroup: /system.slice/slurmd.service
└─3466 /usr/sbin/slurmd -D
and now, with restarting the services as you outlined in the other reply, i get this:
[@localhost ~]$ sudo systemctl restart sen2agri-executor sen2agri-orchestrator sen2agri-scheduler
Warning: sen2agri-executor.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Warning: sen2agri-orchestrator.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Warning: sen2agri-scheduler.service changed on disk. Run 'systemctl daemon-reload' to reload units.
[@localhost ~]$ sudo systemctl deamon-reload
Unknown operation 'deamon-reload'.
Have you succeeded with SLURM execution? On our side we were not able to simulate the issue but we assume that there is something related with the MariaDB installation.
What if you are doing :
You can try also re-installing MariaDB. Please note that MariaDB is used only by SLURM and not by the system. The system is using a PostgreSQL database which is completely independent of MariaDB.
Also, even if you would delete by mistake the entries in the system database, you would not loose the real data which is stored on disk (unless you explicitly remove the products also from disk).
If you do not succeed in solving the issue, could you provide us access to the machine to have a look and see what is going on there (you can write me an email)?