Where to store datasets

A stub of a guide to come. TLDR: if the dataset is small and non-proprietary, you can keep it in the repo along with your code. Otherwise, store it externally somewhere (presumably networked storage for sensitive data). Use environment variables in your code to refer to the location of data, so that whoever is running the code can specify it at runtime and e.g. use a small sample dataset instead while code reviewing.

That said: Investigate Git LFS for version controlling large datasets.

Last updated

Was this helpful?