cache data

Cache data to avoid downloading the data on each run of the script. Caching data can help make your workflow reproducible, save time, and is considerate of data providers. To cache data, you will first check whether the data exists locally and only download if it does not exist. I recommend caching data in a `data/` directory. Always use `os.path.join` to specify paths so that your code works on multiple operating systems. > [!tip] > See [[project organization for data science]] for more suggestions on project organization. ```python import os def download_and_cache(dir_name, file_name, url): """ Downloads and caches data. Parameters ---------- dir_name: str Directory name to save data file_name: str File name to save data url: str URL to data source Returns ------- None """ # Create directory if not os.path.exists(dir_name): print(f'Creating directory {dir_name}') os.makedirs(dir_name) # Download & cache data file_path = os.path.join(dir_name, file_name) if not os.path.exists(file_path): response = requests.get(url) # Use appropriate download method with open(file_path, 'w') as f: f.write(response.content) # Use appropriate save method print(response.status_code) else: print(f'Data previously cached at {file_path}') ``` ## Cache in home directory Some prefer to cache data in the user's home directory rather than the project directory. Most operating systems have a home directory, however the downside of this approach is the data are stored outside of the project directory. This is most common when working in teams that share analyses through Jupyter Notebooks--because you don't know where the user is running the Notebook from, you might not know where the data will get stored. ```python import os from pathlib import Path def download_and_cache(dir_name, file_name, url): # Create directory dir_path = os.path.join(Path.home(), dir_name) if not os.path.exists(dir_path): print(f'Creating directory {dir_path}') os.makedirs(dir_path) # Download and cache data file_path = os.path.join(dir_path, file_name) if not os.path.exists(file_path): # Use appropriate download method data = requests.get(url) # Use appropriate save method data.save(file_path) return data ``` When using a Jupyter Notebook, you can also set the working directory relative to the user's home directory. This makes it easier to download from multiple sources. ```python dir_path = os.path.join(Path.home(), dir_name) os.chdir(dir_path) if not os.path.exists(file_name): data = requests.get(url) data.save(file_name) else: data = read(file_name) ```