Generating L3A error

Hello Sen4Cap and @cudroiu,

Sen4Cap has been running without any issues and recently we are facing below error when we tried to generate e L3A product from custom job. slurm is running fine and there is no error in the log.

Here is the error output i got. How can we solve it out?

Unable to create job output path /mnt/archive/orchestrator_temp/l3b/17705/128446-lai-processor-mask-flags/

Thanks and Regards,
Henry

Hello,

Could you check that you are not actually running out of disk space?
Also, is /mnt/archive correctly mounted (if is the case)?
If you didn’t changed any rights to the /mnt/archive/orchestrator_temp (parents or children), I see no other reasons.

Best regards,
Cosmin

Dear @cudroiu,

Thank you for your advice. Disk space is low as you thought and I deleted some files as below screenshot.

image

This time, even job is created, docker is not running. How can I solve it out?

image

Thanks and Regards,
Henry

Dear Henry,

Actually, I think is SLURM that is having issues (this is a known issue with slurm when it runs out of disk space). Please see if this post solves your issue:

Best regards,
Cosmin

Dear @cudroiu,

Thank you for getting back to me. I have tried it earlier but it still didn’t work. I also rebooted the server. Here is slurm related log I found. What could be the reason not starting?

tail -f /var/log/slurm/slurmdbd.log
[2021-04-19T07:37:31.460] Terminate signal (SIGINT or SIGTERM) received
[2021-04-19T07:37:31.961] error: mysql_real_connect failed: 2002 Can’t connect to local MySQL server through socket ‘/var/lib/mysql/mysql.sock’ (2)
[2021-04-19T07:37:31.961] error: unable to re-connect to as_mysql database
[2021-04-19T07:37:31.961] error: mysql_real_connect failed: 2002 Can’t connect to local MySQL server through socket ‘/var/lib/mysql/mysql.sock’ (2)
[2021-04-19T07:37:31.961] error: unable to re-connect to as_mysql database
[2021-04-19T07:37:31.961] Unable to remove pidfile ‘/var/run/slurmdbd.pid’: Permission denied
[2021-04-19T07:37:31.985] error: mysql_real_connect failed: 2002 Can’t connect to local MySQL server through socket ‘/var/lib/mysql/mysql.sock’ (2)
[2021-04-19T07:37:31.985] error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds.
[2021-04-19T07:37:37.535] Accounting storage MYSQL plugin loaded
[2021-04-19T07:37:37.538] slurmdbd version 15.08.7 started

tail -f /var/log/slurm/slurmd.log
[2021-04-19T06:16:21.409] _run_prolog: run job script took usec=7
[2021-04-19T06:16:21.409] _run_prolog: prolog with lock for job 47046 ran for 0 seconds
[2021-04-19T06:16:21.909] [47046.0] done with job
[2021-04-19T07:37:31.460] Slurmd shutdown completing
[2021-04-19T07:37:31.916] Message aggregation disabled
[2021-04-19T07:37:31.917] CPU frequency setting not configured for this node
[2021-04-19T07:37:31.917] Resource spec: Reserved system memory limit not configured for this node
[2021-04-19T07:37:31.917] slurmd version 15.08.7 started
[2021-04-19T07:37:31.918] slurmd started on Mon, 19 Apr 2021 07:37:31 +0000
[2021-04-19T07:37:31.918] CPUs=8 Boards=1 Sockets=8 Cores=1 Threads=1 Memory=64265 TmpDisk=245693 Uptime=77276 CPUSpecList=(null)

tail -f /var/log/slurm/slurm.log
[2021-04-19T07:37:32.135] Recovered state of 0 reservations
[2021-04-19T07:37:32.135] read_slurm_conf: backup_controller not specified.
[2021-04-19T07:37:32.135] cons_res: select_p_reconfigure
[2021-04-19T07:37:32.135] cons_res: select_p_node_init
[2021-04-19T07:37:32.135] cons_res: preparing for 2 partitions
[2021-04-19T07:37:32.135] Running as primary controller
[2021-04-19T07:37:32.135] Registering slurmctld at port 6817 with slurmdbd.
[2021-04-19T07:37:32.136] Recovered information about 0 sicp jobs
[2021-04-19T07:37:35.140] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0
[2021-04-19T07:37:37.541] Registering slurmctld at port 6817 with slurmdbd.

Regards,
Henry

What error do you get when you run "srun ls -l " under the sen2agri-service user?

Dear @cudroiu ,

That’s what I got.

[sen2agri-service@teo-sen4capv2 ~]$ srun ls -l
total 4
drwxrwxr-x. 15 sen2agri-service sen2agri-service 4096 Jul 13 2020 miniconda3

Thanks and Regards,
Henry

Dear Henry,

Apparently, SLURM is working now.
Could you try launching again a custom job? You can also try resuming the current one(s) by going to System overview -> Pause (for a job) -> Wait a few seconds and then refresh page -> Resume

Hope this helps.

Best regards,
Cosmin

Dear @cudroiu,

I have created jobs and also tried to resume. It’s stuck. :frowning: Any advice?

Thanks and Regards,
Henry

Dear Henry,

Could you please provide the logs for the sen2agri-orchestrator and sen2agri-executor upon launching a custom job?

Best regards,
Cosmin

1 Like

Dear @cudroiu,

It’s working now. You are the best! :slight_smile:

I found below error when I checked status of sen2agri-orchestrator and I restarted the service. It’s working as expected after that. Thanks a lot!

Apr 19 10:19:47 teo-sen4capv2.novalocal sen2agri-orchestrator[3357]: Network error while invoking executor function 1
Apr 19 10:39:07 teo-sen4capv2.novalocal sen2agri-orchestrator[3357]: Processing job submitted event with job id 17722 and processor id 3
Apr 19 10:39:08 teo-sen4capv2.novalocal sen2agri-orchestrator[3357]: Using L2A tile: /mnt/archive/maccs_def/prov14north/l2a/2019/03/21/S2A_MSIL2A_20190321T034531_N0207_R104_T47QPB_20190321T090904.SAFE/SENTINEL2A_20190321-040251-997_L2A_T47QPB_C_V1-0/SENTINEL2A_20190321-040251-997_L2A_T47QPB_C_V1-0_MTD_ALL.xml
Apr 19 10:39:08 teo-sen4capv2.novalocal sen2agri-orchestrator[3357]: SubmitTask took 131 ms
Apr 19 10:39:08 teo-sen4capv2.novalocal sen2agri-orchestrator[3357]: Processing task runnable event with processor id 3 task id 134067 and job id 17722
Apr 19 10:39:08 teo-sen4capv2.novalocal sen2agri-orchestrator[3357]: Network error while invoking executor function 1

Best Regards,
Henry