Skip to content

Compute worker dataset caching

Placeholder Eric Carmichael requested to merge 334-dataset-caching into develop

@ mention of reviewers`

@jimmykodes`

A brief description of the purpose of the changes contained in this PR.

Adds dataset caching to speed up compute worker runs. (issue #334 (closed))

Known issues/discussion

  1. This caches everything.. could probably not cache submissions and maybe some other things? Now only caches input_data and reference_data
  2. Should I move these utility functions somewhere else? This is getting kinda gross structure wise. A simple "entry point" module (ie worker.py) with some peripheral modules (ie utils.py, run.py?) seems to be easier to grok..

A checklist for hand testing

  • submissions should process more quickly with fatty datasets, ideally test with large AutoDL datasets
  • cache cleaned when cache size exceeds MAX_CACHE_DIR_SIZE_GB
  • changing MAX_CACHE_DIR_SIZE_GB env var works properly (converts str to float for comparison)

Checklist

  • Code review by me
  • Hand tested by me
  • I'm proud of my work
  • Code review by reviewer
  • Hand tested by reviewer
  • Ready to merge

Merge request reports

Loading