configuration file

Data science is a science because it involves experimentation. The best machine learning models, hyperparameters, and architectures cannot (typically) be known *a priori*. The data scientist must experiment to find the best approach. The first requirement then is an evaluation framework. Before even starting to code, the data scientist will select one or more evaluation metrics. If a simple metric is not available, an evaluation approach should be defined. All experiments will be measured against this metric or approach. Next, the data scientist must set up a system for tracking experiments conducted. Which model was used, with which hyperparameters, in which architectures? Configuration files can help both track experiments and quickly adjust the machine learning approach. Configuration files are typically written in either YAML (most popular in ML), TOML or JSON. For more power, consider one of these libraries or services. - Hydra: lets you compose configs from the command line and keeps automatic run dirs - OmegaConf: typed configs, works with Hydra - MLflow: track configs + metrics + artifacts automatically - [[Weights and Biases]]: like MLflow but for deep learning - Sacred: lightweight experiment tracker ## YAML configuration file [[YAML]] is a popular choice for writing configuration files. Create `config.yml` with your configuration settings in the root of your project. ```YAML experiment: name: "bert-ft-lr2e-5" seed: 42 model: pretrained: "bert-base-uncased" num_labels: 20 training: batch_size: 16 epochs: 4 learning_rate: 2e-5 eval_strategy: "epoch" data: train_file: "./data/train.csv" val_file: "./data/val.csv" ``` In your `train.py` file, load the configuration with the `yaml` library and read in configuration settings. ```python import yaml from pathlib import Path def load_config(path: str): with open(path, "r") as f: return yaml.safe_load(f) cfg = load_config("configs/experiment1.yaml") lr = cfg["training"]["learning_rate"] ``` At the start of each run, 1. copy the config into the results folder so you always know exactly what parameters produced those results. 2. copy the config under `configs/` (renaming with the timestamp and experiment name). Each experiment config is a snapshot of how you trained, allowing fast repeats of previous experiments. ``` . config.yaml # current run results/ +-- 2025-08-28-bert-ft-lr2e-5/ +-- config.yaml +-- metrics.json +-- logs.txt +-- checkpoint.pt configs/ +-- 2025-08-28-bert-ft-lr2e-5.yaml # renamed and archived ``` Use `argparse` to run `training.py` from the command line and easily swap configs. ```python import argparse parser = argparse.ArgumentParser() parser.add_argument("--config", required=True) args = parser.parse_args() cfg = load_config(args.config) ``` Run the script with ```bash python train.py --config config.yaml ``` ## TOML configuration file [[TOML]] is another popular format for configuration files. Create a `config.toml` file in the root of your directory. ```toml [llm] model = "gemma3" base_url = "http://localhost:11434/v1/chat/completions" [chunking] chunk_size = 100 # Number of words per chunk ``` In `train.py` read the configuration file and settings with the `tomli` library. ```python import tomli # load config file with open(config_file, "rb") as f: config = tomli.load(f) # typical configuration model = config["llm"]["model"] base_url = config["llm"]["base_url"] # configuration with fallback values chunk_size = config.get("chunking", {}).get("chunk_size", 500) ``` The same patterns used for YAML configuration files (copying configuration files on run, passing configuration files with the command line) can also be used for TOML configuration files.