Sen4Cap has been running without any issues and recently we are facing below error when we tried to generate e L3A product from custom job. slurm is running fine and there is no error in the log.
Here is the error output i got. How can we solve it out?
Unable to create job output path /mnt/archive/orchestrator_temp/l3b/17705/128446-lai-processor-mask-flags/
Could you check that you are not actually running out of disk space?
Also, is /mnt/archive correctly mounted (if is the case)?
If you didn’t changed any rights to the /mnt/archive/orchestrator_temp (parents or children), I see no other reasons.
Actually, I think is SLURM that is having issues (this is a known issue with slurm when it runs out of disk space). Please see if this post solves your issue:
Thank you for getting back to me. I have tried it earlier but it still didn’t work. I also rebooted the server. Here is slurm related log I found. What could be the reason not starting?
tail -f /var/log/slurm/slurmdbd.log
[2021-04-19T07:37:31.460] Terminate signal (SIGINT or SIGTERM) received
[2021-04-19T07:37:31.961] error: mysql_real_connect failed: 2002 Can’t connect to local MySQL server through socket ‘/var/lib/mysql/mysql.sock’ (2)
[2021-04-19T07:37:31.961] error: unable to re-connect to as_mysql database
[2021-04-19T07:37:31.961] error: mysql_real_connect failed: 2002 Can’t connect to local MySQL server through socket ‘/var/lib/mysql/mysql.sock’ (2)
[2021-04-19T07:37:31.961] error: unable to re-connect to as_mysql database
[2021-04-19T07:37:31.961] Unable to remove pidfile ‘/var/run/slurmdbd.pid’: Permission denied
[2021-04-19T07:37:31.985] error: mysql_real_connect failed: 2002 Can’t connect to local MySQL server through socket ‘/var/lib/mysql/mysql.sock’ (2)
[2021-04-19T07:37:31.985] error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds.
[2021-04-19T07:37:37.535] Accounting storage MYSQL plugin loaded
[2021-04-19T07:37:37.538] slurmdbd version 15.08.7 started
tail -f /var/log/slurm/slurmd.log
[2021-04-19T06:16:21.409] _run_prolog: run job script took usec=7
[2021-04-19T06:16:21.409] _run_prolog: prolog with lock for job 47046 ran for 0 seconds
[2021-04-19T06:16:21.909] [47046.0] done with job
[2021-04-19T07:37:31.460] Slurmd shutdown completing
[2021-04-19T07:37:31.916] Message aggregation disabled
[2021-04-19T07:37:31.917] CPU frequency setting not configured for this node
[2021-04-19T07:37:31.917] Resource spec: Reserved system memory limit not configured for this node
[2021-04-19T07:37:31.917] slurmd version 15.08.7 started
[2021-04-19T07:37:31.918] slurmd started on Mon, 19 Apr 2021 07:37:31 +0000
[2021-04-19T07:37:31.918] CPUs=8 Boards=1 Sockets=8 Cores=1 Threads=1 Memory=64265 TmpDisk=245693 Uptime=77276 CPUSpecList=(null)
tail -f /var/log/slurm/slurm.log
[2021-04-19T07:37:32.135] Recovered state of 0 reservations
[2021-04-19T07:37:32.135] read_slurm_conf: backup_controller not specified.
[2021-04-19T07:37:32.135] cons_res: select_p_reconfigure
[2021-04-19T07:37:32.135] cons_res: select_p_node_init
[2021-04-19T07:37:32.135] cons_res: preparing for 2 partitions
[2021-04-19T07:37:32.135] Running as primary controller
[2021-04-19T07:37:32.135] Registering slurmctld at port 6817 with slurmdbd.
[2021-04-19T07:37:32.136] Recovered information about 0 sicp jobs
[2021-04-19T07:37:35.140] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0
[2021-04-19T07:37:37.541] Registering slurmctld at port 6817 with slurmdbd.
Apparently, SLURM is working now.
Could you try launching again a custom job? You can also try resuming the current one(s) by going to System overview -> Pause (for a job) -> Wait a few seconds and then refresh page -> Resume
I found below error when I checked status of sen2agri-orchestrator and I restarted the service. It’s working as expected after that. Thanks a lot!
Apr 19 10:19:47 teo-sen4capv2.novalocal sen2agri-orchestrator[3357]: Network error while invoking executor function 1
Apr 19 10:39:07 teo-sen4capv2.novalocal sen2agri-orchestrator[3357]: Processing job submitted event with job id 17722 and processor id 3
Apr 19 10:39:08 teo-sen4capv2.novalocal sen2agri-orchestrator[3357]: Using L2A tile: /mnt/archive/maccs_def/prov14north/l2a/2019/03/21/S2A_MSIL2A_20190321T034531_N0207_R104_T47QPB_20190321T090904.SAFE/SENTINEL2A_20190321-040251-997_L2A_T47QPB_C_V1-0/SENTINEL2A_20190321-040251-997_L2A_T47QPB_C_V1-0_MTD_ALL.xml
Apr 19 10:39:08 teo-sen4capv2.novalocal sen2agri-orchestrator[3357]: SubmitTask took 131 ms
Apr 19 10:39:08 teo-sen4capv2.novalocal sen2agri-orchestrator[3357]: Processing task runnable event with processor id 3 task id 134067 and job id 17722
Apr 19 10:39:08 teo-sen4capv2.novalocal sen2agri-orchestrator[3357]: Network error while invoking executor function 1
Hi,
I am having the same problem but restarting sen2agri-orchestrator did not solve the problem. Did you do anything else to fix this?
Best regards
Bastian
// update 2021-08-11
Nevermind I found the problem. I had to reinstall slurm and now its working.