L3B vegetation status not working V3.0

Hi All,

I upgraded the system from 2 to 3 yesterday, and it seems like a great improvement! Landsat is downloading again and i like the new options very much, so thanks for making this happen.

However, both custom jobs and scheduled L3B jobs keep stuck at 0/18, without errors. The system was overloaded which created low diskspace after upgrading as it initiated jobs for all sites. i suspect from previous forum topics that it has to do with SLURM and/or the sen2agri-services.

Orchestrator log shows:

● sen2agri-orchestrator.service - Orchestrator for Sen2Agri
   Loaded: loaded (/usr/lib/systemd/system/sen2agri-orchestrator.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-01-06 09:09:18 CET; 15min ago
 Main PID: 3313 (sen2agri-orches)
    Tasks: 5
   Memory: 5.4M
   CGroup: /system.slice/sen2agri-orchestrator.service
           └─3313 /usr/bin/sen2agri-orchestrator

Jan 06 09:14:39 localhost.localdomain sen2agri-orchestrator[3313]: Processing job cancelled event with job id 5163
Jan 06 09:14:39 localhost.localdomain sen2agri-orchestrator[3313]: Processing job cancelled event with job id 5184
Jan 06 09:14:39 localhost.localdomain sen2agri-orchestrator[3313]: Processing job paused event with job id 5158
Jan 06 09:14:39 localhost.localdomain sen2agri-orchestrator[3313]: Processing job paused event with job id 5159
Jan 06 09:14:49 localhost.localdomain sen2agri-orchestrator[3313]: Processing job paused event with job id 5157
Jan 06 09:14:49 localhost.localdomain sen2agri-orchestrator[3313]: Processing job paused event with job id 5156
Jan 06 09:14:49 localhost.localdomain sen2agri-orchestrator[3313]: Processing job paused event with job id 5155
Jan 06 09:18:39 localhost.localdomain sen2agri-orchestrator[3313]: Processing product available event with product id 5668
Jan 06 09:25:09 localhost.localdomain sen2agri-orchestrator[3313]: Processing job submitted event with job id 5185 and processor id 3
Jan 06 09:25:10 localhost.localdomain sen2agri-orchestrator[3313]: Processing task runnable event with processor id 3 task id 83574 and job id 5185

The issue led me to this topic:

Going into the services worked, but after i got the following error:

Last login: Thu Jan 6 00:54:29 CET 2022 on pts/2
[sen2agri-service@localhost ~]$ srun ls -al
bash: srun: command not found…

Trying the following commands from that post yields this:

[sen2agri-service@localhost ~]$ sudo -u sen2agri-service scontrol update NodeName=localhost State=RESUME
sudo: scontrol: command not found
[sen2agri-service@localhost ~]$ sudo systemctl restart slurmd slurmdbd slurmctld mariadb

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

#1) Respect the privacy of others.
#2) Think before you type.
#3) With great power comes great responsibility.

[sudo] password for sen2agri-service:

Does someone has some words of advice?
and, if that last step is required, what is the password of sen2agri-service?

Hope this post makes the problem as clear as possible:)
any help would be much appreciated!

Best regards,

Niek

Some extra content, after trying to install Slurm only:

Thanks for using MariaDB!
spawn mysql -u root -p -e create database slurm_acct_db;create user slurm@localhost;
set password for slurm@localhost = password(‘sen2agri’);grant usage on . to slurm;grant all privileges on slurm_acct_db.* to slurm;flush privileges;
Enter password:
ERROR 1007 (HY000) at line 1: Can’t create database ‘slurm_acct_db’; database exists
Failed to start slurmdbd.service: Unit not found.
Failed to execute operation: No such file or directory
Unit slurmdbd.service could not be found.
SLURM DB SERVICE:
./install_slurm_only.sh: line 146: sacctmgr: command not found
mkdir: cannot create directory ‘/var/spool/slurm’: File exists
mkdir: cannot create directory ‘/var/log/slurm’: File exists
Failed to start slurmctld.service: Unit not found.
Failed to execute operation: No such file or directory
Unit slurmctld.service could not be found.
SLURM CTL SERVICE:
Failed to start slurmd.service: Unit not found.
Failed to execute operation: No such file or directory
Unit slurmd.service could not be found.
SLURM NODE SERVICE:
Failed to start slurm.service: Unit not found.
Failed to execute operation: No such file or directory
Unit slurm.service could not be found.
SLURM SERVICE:
./install_slurm_only.sh: line 223: sacctmgr: command not found
./install_slurm_only.sh: line 226: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 237: sacctmgr: command not found
./install_slurm_only.sh: line 238: sacctmgr: command not found
./install_slurm_only.sh: line 240: sacctmgr: command not found
CLUSTER,USERS,QOS INFO:
./install_slurm_only.sh: line 244: sacctmgr: command not found
QOS INFO:
./install_slurm_only.sh: line 247: sacctmgr: command not found
Partition INFO:
./install_slurm_only.sh: line 250: scontrol: command not found
Nodes INFO:
./install_slurm_only.sh: line 253: scontrol: command not found

Hope this might give a lead to the cause:)

Dear Niek,

For me seems it that SLURM is not installed. Could you try:
sudo yum -y install slurm slurm-slurmctld slurm-slurmd slurm-devel slurm-pam_slurm slurm-perlapi slurm-slurmdbd slurm-torque slurm-libs

Best regards,
Cosmin

Dear both,

I’m also struggeling with L3b-Processor after updating to 3.0. The orchestrator is encountering a network error. Do you have any idea where this could come from? In the monitoring-tab the job stays as “submitted” and never starts to run.

● sen2agri-orchestrator.service - Orchestrator for Sen2Agri
   Loaded: loaded (/usr/lib/systemd/system/sen2agri-orchestrator.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-01-13 13:47:25 UTC; 25min ago
 Main PID: 4492 (sen2agri-orches)
    Tasks: 5
   Memory: 4.4M
   CGroup: /system.slice/sen2agri-orchestrator.service
           └─4492 /usr/bin/sen2agri-orchestrator

Jan 13 13:47:25 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Reading settings from /etc/sen2agri/sen2agri.conf
Jan 13 13:47:25 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Reading settings from /etc/sen2agri/sen2agri.conf
Jan 13 13:47:25 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Invalid processor configuration found in database: l2-s1, igoring it as no handler is available for it!
Jan 13 13:47:25 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Invalid processor configuration found in database: lpis, igoring it as no handler is available for it!
Jan 13 13:47:26 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: GetNewEvents took 304 ms
Jan 13 13:58:15 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Processing job submitted event with job id 12885 and processor id 3
Jan 13 13:58:16 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Reading settings from /etc/sen2agri/sen2agri.conf
Jan 13 13:58:17 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Processing task runnable event with processor id 3 task id 51748 and job id 12885
Jan 13 13:58:17 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Network error while invoking executor function 1
Jan 13 13:58:17 sen4cap-v2-25062021.novalocal sen2agri-orchestrator[4492]: Network error while invoking executor function 1

Thanks for your help,
best regards,
Martin

Dear Martin,

You could try restarting the sen2agri-executor. You can do :

sudo systemctl restart sen2agri-executor sen2agri-orchestrator sen2agri-scheduler

And then try executing again your job (alternatively you can go into System Monitoring → Current job → Pause then Resume the existing job).

Hope this helps.

Best regards,
Cosmin

1 Like

Hi Cosmin, thank you for your suggestion. i did execute that command but all seems to be installed.

Loading mirror speeds from cached hostfile
 * base: mirrors.supportex.net
 * epel: mirror.hostnet.nl
 * extras: mirrors.supportex.net
 * updates: mirror.nforce.com
Package slurm-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-slurmctld-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-slurmd-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-devel-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-pam_slurm-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-perlapi-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-slurmdbd-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-torque-20.11.8-2.el7.x86_64 already installed and latest version
Package slurm-libs-20.11.8-2.el7.x86_64 already installed and latest version
Nothing to do

also, after restarting the processors, i get the following slurm status updates:
~

]$ systemctl status slurm{dbd,ctld,d}
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2022-01-13 14:56:04 CET; 1 weeks 0 days ago
 Main PID: 1971 (code=exited, status=1/FAILURE)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2022-01-13 14:55:59 CET; 1 weeks 0 days ago
 Main PID: 2006 (code=exited, status=1/FAILURE)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-01-13 14:56:01 CET; 1 weeks 0 days ago
 Main PID: 3466 (slurmd)
    Tasks: 2
   Memory: 5.9M
   CGroup: /system.slice/slurmd.service
           └─3466 /usr/sbin/slurmd -D

and now, with restarting the services as you outlined in the other reply, i get this:

[@localhost ~]$ sudo systemctl restart sen2agri-executor sen2agri-orchestrator sen2agri-scheduler
Warning: sen2agri-executor.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Warning: sen2agri-orchestrator.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Warning: sen2agri-scheduler.service changed on disk. Run 'systemctl daemon-reload' to reload units.
[@localhost ~]$ sudo systemctl deamon-reload
Unknown operation 'deamon-reload'.

any thoughts how to proceed from here?

Just one more constant error that keeps popping up is this:


sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused

does this ring a bell? hope i can get the system back online:)

Dear Niek,

Have you succeeded with SLURM execution? On our side we were not able to simulate the issue but we assume that there is something related with the MariaDB installation.
What if you are doing :

mysql -h 127.0.0.1 -u root -p

(password empty)

MariaDB [(none)]> show databases;

Should display something like this:

±-------------------+
| Database |
±-------------------+
| information_schema |
| mysql |
| performance_schema |
| slurm_acct_db |
±-------------------+

MariaDB [(none)]> use slurm_acct_db;
MariaDB [(none)]> show tables;

Should display something like this:

±---------------------------------+
| Tables_in_slurm_acct_db |
±---------------------------------+
| acct_coord_table |
| acct_table |
| clus_res_table |
| cluster_table |
| convert_version_table |
| federation_table |
| qos_table |
| res_table |
| sen2agri_assoc_table |
| sen2agri_assoc_usage_day_table |
| sen2agri_assoc_usage_hour_table |
| sen2agri_assoc_usage_month_table |
| sen2agri_event_table |
| sen2agri_job_table |
| sen2agri_last_ran_table |
| sen2agri_resv_table |
| sen2agri_step_table |
| sen2agri_suspend_table |
| sen2agri_usage_day_table |
| sen2agri_usage_hour_table |
| sen2agri_usage_month_table |
| sen2agri_wckey_table |
| sen2agri_wckey_usage_day_table |
| sen2agri_wckey_usage_hour_table |
| sen2agri_wckey_usage_month_table |
| table_defs_table |
| tres_table |
| txn_table |
| user_table |
±---------------------------------+

Please let me know.

Best regards,
Cosmin

Hi Niek,

Additionally you can try:

sudo journalctl -u mariadb

And after the command:

mysql -h 127.0.0.1 -u root -p

You could try:

MariaDB [(none)]> repair table ‘mysql.proc’;

Best regards,
Cosmin

Hi Cosmin, thanks for the reply. i think your correct about the maria DB issue: show tables gives an empty result '-> ’

Maria DB server version is 5.5.68, connection ID = 3.

it might be better to remove mariaDB too, and do a full fresh install? given i loose 2 years of data and no AWS connection to get it back so far…

Hi Niek,

You can try also re-installing MariaDB. Please note that MariaDB is used only by SLURM and not by the system. The system is using a PostgreSQL database which is completely independent of MariaDB.
Also, even if you would delete by mistake the entries in the system database, you would not loose the real data which is stored on disk (unless you explicitly remove the products also from disk).
If you do not succeed in solving the issue, could you provide us access to the machine to have a look and see what is going on there (you can write me an email)?

Best regards,
Cosmin