Cache data to avoid downloading the data on each run of the script. Caching data can help make your workflow reproducible, save time, and is considerate of data providers.
To cache data, you will first check whether the data exists locally and only download if it does not exist.
I recommend caching data in a `data/` directory. Always use `os.path.join` to specify paths so that your code works on multiple operating systems.
> [!tip]
> See [[project organization for data science]] for more suggestions on project organization.
```python
import os
def download_and_cache(dir_name, file_name, url):
"""
Downloads and caches data.
Parameters
----------
dir_name: str
Directory name to save data
file_name: str
File name to save data
url: str
URL to data source
Returns
-------
None
"""
# Create directory
if not os.path.exists(dir_name):
print(f'Creating directory {dir_name}')
os.makedirs(dir_name)
# Download & cache data
file_path = os.path.join(dir_name, file_name)
if not os.path.exists(file_path):
response = requests.get(url) # Use appropriate download method
with open(file_path, 'w') as f:
f.write(response.content) # Use appropriate save method
print(response.status_code)
else:
print(f'Data previously cached at {file_path}')
```
## Cache in home directory
Some prefer to cache data in the user's home directory rather than the project directory. Most operating systems have a home directory, however the downside of this approach is the data are stored outside of the project directory. This is most common when working in teams that share analyses through Jupyter Notebooks--because you don't know where the user is running the Notebook from, you might not know where the data will get stored.
```python
import os
from pathlib import Path
def download_and_cache(dir_name, file_name, url):
# Create directory
dir_path = os.path.join(Path.home(), dir_name)
if not os.path.exists(dir_path):
print(f'Creating directory {dir_path}')
os.makedirs(dir_path)
# Download and cache data
file_path = os.path.join(dir_path, file_name)
if not os.path.exists(file_path):
# Use appropriate download method
data = requests.get(url)
# Use appropriate save method
data.save(file_path)
return data
```
When using a Jupyter Notebook, you can also set the working directory relative to the user's home directory. This makes it easier to download from multiple sources.
```python
dir_path = os.path.join(Path.home(), dir_name)
os.chdir(dir_path)
if not os.path.exists(file_name):
data = requests.get(url)
data.save(file_name)
else:
data = read(file_name)
```