Compute worker dataset caching
@ mention of reviewers`
@jimmykodes`
A brief description of the purpose of the changes contained in this PR.
Adds dataset caching to speed up compute worker runs. (issue #334 (closed))
Known issues/discussion
-
This caches everything.. could probably not cache submissions and maybe some other things?Now only cachesinput_data
andreference_data
- Should I move these utility functions somewhere else? This is getting kinda gross structure wise. A simple "entry point" module (ie worker.py) with some peripheral modules (ie utils.py, run.py?) seems to be easier to grok..
A checklist for hand testing
-
submissions should process more quickly with fatty datasets, ideally test with large AutoDL datasets -
cache cleaned when cache size exceeds MAX_CACHE_DIR_SIZE_GB
-
changing MAX_CACHE_DIR_SIZE_GB
env var works properly (converts str to float for comparison)
Checklist
-
Code review by me -
Hand tested by me -
I'm proud of my work -
Code review by reviewer -
Hand tested by reviewer -
Ready to merge