[[good software]] Deploy first, write tests, then develop. This turns [[Waterfall]] on its head. [[Kernighan's Law]] Software development lifecycle ## programming paradigm There are many approaches to software engineering, and these are typically described as programming paradigms. Examples include - Imperative - Procedural (Fortran, C, [[Python]]) - [[object oriented programming]] (C#, [[Java]], C++, Ruby, also Python) - Declarative - Logic (Prolog, Datalog) - Functional (Lisp, Haskell, Closure ,F#,) ## design pattern A design pattern is a repeatable approach to solving problems in [[software engineering]]. If you can reduce a real-world pattern to one that can be solved with a design pattern, you can just use the existing solution. Using design patterns also facilitates communication within software development teams. [[functional decomposition]] [[use case]] [[requirements]] [[the V-model]] [[Waterfall]] [[Agile]] [[minimum viable product]] [[service level agreement]] [[refactor]] [[test driven development]] [[continuous integration]] [[continuous delivery]] [[retrospective]] [[best practices in software engineering]] ## advanced concepts - Environment management - optional and group dependencies (dev/prod) - project organization/file management - Version control (git, GitHub, branches, git commit comments) - Collaborating (GitHub issues, forking, pull requests, upstream remotes) - Ruff/linting - Testing - Logging - Make/Poe - Package as a python package - editable imports - containers - documentation - README, requirements, spec, etc. - collaborating (GitHub, Issues, PR, etc.) - CI/CD, DevOps - [[scope]] and software estimation # Best Practices in Code Development for Data Science *ChatGPT list* ## 1. Version Control & Collaboration - **Git basics** - Branching strategy (feature branches, `main`/`develop`) - Commit messages (clear, imperative style, e.g., *"Add parser for minerU structured output"*) - Pull requests & code review - **GitHub workflow** - Remote repositories (push/pull) - Using issues and project boards - Tags/releases for versioned milestones ## 2. Software Engineering Practices - **Test-Driven Development (TDD)** - Write tests before implementation - Unit tests for small methods - Integration tests for full parser → ChromaDB flow - **Object-Oriented Design** - Classes for Parser, Chunk, Storage, etc. - Encapsulation of functionality - Interfaces/abstract base classes for extensibility - **Type Hints & Docstrings** - PEP 484 typing (`list[str]`, `Optional[dict[str, Any]]`) - PEP 257 docstring conventions - Tools: `mypy`, `pydoc`, IDE integration - **Logging** - `logging` module instead of `print` - Log levels: DEBUG, INFO, WARNING, ERROR - Log to file vs console for debugging and production ## 3. Project Structure - **Python package layout** - `src/` or flat layout (`src/parser/`, `src/storage/`) - `tests/` for unit and integration tests - **Importable library design** - `__init__.py` for clean namespace - Editable install with `pip install -e .` (or `uv pip install -e .`) - **Configuration management** - Config files (`config.yaml`, `.env`) - Avoid hard-coding paths or API keys ## 4. Testing & Quality Assurance - **Testing frameworks** - `pytest` (fixtures, parametrization, mocks) - Coverage tools (`pytest-cov`) - **Continuous Integration (CI)** - GitHub Actions for automated testing - Linting and formatting checks in CI - **Code Quality** - Linters: `flake8`, `ruff` - Formatters: `black` - Static type checks: `mypy` ## 5. Documentation & Communication - **Code documentation** - Inline comments for tricky logic - Docstrings for functions/classes - **External documentation** - README with installation & usage - CONTRIBUTING.md if collaborative - **API Documentation** - Tools like `Sphinx` or `mkdocs` if project grows ## 6. Development Environment - **Environment management** - `uv`/`mamba`/`venv` for reproducible environments - `pyproject.toml` for dependencies - **Editable installs** - `pip install -e .` for live development - **Pre-commit hooks** - Auto-run `black`, `ruff`, and tests on commit ## 7. Data Science–Specific Practices - **Experiment tracking** - Save parser configurations (chunk size, filters) in YAML/JSON - Log metrics for chunking quality - **Reproducibility** - Deterministic runs (random seeds) - Versioning of datasets and embeddings - **Integration with ChromaDB** - Schema for metadata - Tests for persistence and retrieval