-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Hopefully solving several points: #2223
Containers not removed
- 11/02/2026: submissions containers staying up forever
Wrong log when storage is full
When docker pull fails because of full storage, we have no clear logs.
See:
-
Have the right error logs, and have them on the platform's UI
-
Detect errors in
_get_container_image()
Then it gets stuck in Running state.
Progress bar
Related: show_progress and the progress bar adds up to the mess:
- Make
show_progress()more robust (not treating missing keys as errors)
Sometimes also gives lot of errors like this:
2026-02-28 02:38:37.854 | ERROR | compute_worker:show_progress:137 - There was an error showing the progress bar
2026-02-28 02:38:37.854 | ERROR | compute_worker:show_progress:138 - 6
2026-02-28 02:38:37.955 | ERROR | compute_worker:show_progress:137 - There was an error showing the progress bar
2026-02-28 02:38:37.955 | ERROR | compute_worker:show_progress:138 - 1Logs
- Sometimes no submission logs
- Add logs at the start of submission container with metadata of the competition and submission
- Add a clear log in the computer worker container with the competition title when receiving a submission
- Similarly to other problems reported, sometimes we only have "Time limit exceeded" and no other logs (e.g. Stuck at "Preparing submission... this may take a few moments.." #1994)
No space left
How to manage the disks? Should we limit docker images size?
Submissions not marked as Failed
Submissions stuck in "Running" or "Scoring" or status
- Submissions stuck in "Scoring" state instead of "Failed" when the compute worker crashes (Worker status to FAILED instead of SCORING or FINISHED in case of failure #2030)
Related issues:
-
submission in "Scoring" status for multiple hours on default queue #1184
-
Stuck at "Preparing submission... this may take a few moments.." #1994
-
Similarly, it looks like the status get stuck to "Preparing" when failing during this process.
Example failure during "Preparing":
[2025-09-18 11:25:05,234: ERROR/ForkPoolWorker-2] Task compute_worker_run[fd956bf5-3e2d-4168-ab48-f0896dc80993] raised unexpected: OSError(28, 'No space left on device')
Traceback (most recent call last):
[...]
OSError: [Errno 28] No space left on deviceDuplication of submission files
To check
The log level is defined in this way in compute_worker.py:
configure_logging(
os.environ.get("LOG_LEVEL", "INFO"), os.environ.get("SERIALIZED", "false")
)Generally we want as much log as possible, so we may want to be in "DEBUG" log level.
Directory structure problem
Docker pull failing
- Docker pull failing
Pull for image: codalab/codalab-legacy:py39 returned a non-zero exit code! Check if the docker image exists on docker hub.
Related issues:
- submission in "Scoring" status for multiple hours on default queue #1184
- Solution is always running #1278
- Submission stuck on scoring status (Twice) #1263
Solution:
- To have more logs, we need to update
compute_worker.pyso we print more logs in the logger (More logs when docker pull fails in compute_worker.py #1283).
Logs at the wrong place
- Docker pull error during scoring are written in ingestion stderr instead of scoring stdrerr (Docker pull in scoring #1204)
Solved by: Show error in scoring std_err #1214
No hostname in server status when status is "Preparing"
- The "Preparing" status means that the worker is downloading the necessary data and programs to run the submission. We should already have a hostname in the server status page during this phase, but it is not the case. (fixed in Worker status to FAILED instead of SCORING or FINISHED in case of failure #2030)