Improve compute_worker.py - Dataset submission bundle stuck at scoring sometimes
TODOs
-
Re-organize the code of compute_worker.py and add comments to each function and tricky parts to make the code clearer. -
Wrap everything in try/catch to catch every error -
Fix the size / timing problem with ingestion outputing big files (see symptoms below) (#1130)
Related problems:
-
No logs when compute worker crashes (https://github.com/codalab/codabench/issues/902) -
We need to show the hostname of scoring job in server status (https://github.com/codalab/codabench/issues/744)
Symptoms
- Submission stuck at scoring
-
BadZipFile
error in the compute worker when trying to unpack the output from ingestion program
Examples are given below.
Lisheng's bundle
Lisheng's bundle, which is a modified Mini-AutoML with dataset submission, is sometimes stuck at scoring.
What we have investigated:
- If there is only one worker in the queue, then it works fine
- If there are multiple workers in the queue (default queue), then we have the problem, and we get this error message:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 385, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 650, in _protected_call_
return self.run(*args, **kwargs)
File "/compute_worker.py", line 91, in run_wrapper
run.prepare()
File "/compute_worker.py", line 655, in prepare
zip_file = self._get_bundle(url, path, cache=cache_this_bundle)
File "/compute_worker.py", line 363, in _get_bundle
with ZipFile(bundle_file, 'r') as z:
File "/usr/local/lib/python3.8/zipfile.py", line 1269, in _init_
self._RealGetContents()
File "/usr/local/lib/python3.8/zipfile.py", line 1336, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Another possibility is that the restarting of the worker is what solved the issue for the new queue.
L2RPN
It seems that uploading/downloading the ingestion output is what is failing.
The same issue is happening with the Sim2Real track of L2RPN 2023. The saved agent, passed from ingestion to scoring, weights 5GB. I confirm that the problem does not happen when we do not save the 5GB file. It is a problem of size and timing.
Hypothesis: as ingestion and scoring runs in parallel, maybe the scoring is trying to download and unpack the output from ingestion too early. The file is still not completely saved and zipped when the scoring is downloading it. When we try to same URL later, it is working fine.