[[good software]]
Deploy first, write tests, then develop. This turns [[Waterfall]] on its head.
[[Kernighan's Law]]
Software development lifecycle
## programming paradigm
There are many approaches to software engineering, and these are typically described as programming paradigms. Examples include
- Imperative
- Procedural (Fortran, C, [[Python]])
- [[object oriented programming]] (C#, [[Java]], C++, Ruby, also Python)
- Declarative
- Logic (Prolog, Datalog)
- Functional (Lisp, Haskell, Closure ,F#,)
## design pattern
A design pattern is a repeatable approach to solving problems in [[software engineering]]. If you can reduce a real-world pattern to one that can be solved with a design pattern, you can just use the existing solution. Using design patterns also facilitates communication within software development teams.
[[functional decomposition]]
[[use case]]
[[requirements]]
[[the V-model]]
[[Waterfall]]
[[Agile]]
[[minimum viable product]]
[[service level agreement]]
[[refactor]]
[[test driven development]]
[[continuous integration]]
[[continuous delivery]]
[[retrospective]]
[[best practices in software engineering]]
## advanced concepts
- Environment management
- optional and group dependencies (dev/prod)
- project organization/file management
- Version control (git, GitHub, branches, git commit comments)
- Collaborating (GitHub issues, forking, pull requests, upstream remotes)
- Ruff/linting
- Testing
- Logging
- Make/Poe
- Package as a python package
- editable imports
- containers
- documentation
- README, requirements, spec, etc.
- collaborating (GitHub, Issues, PR, etc.)
- CI/CD, DevOps
- [[scope]] and software estimation
# Best Practices in Code Development for Data Science
*ChatGPT list*
## 1. Version Control & Collaboration
- **Git basics**
- Branching strategy (feature branches, `main`/`develop`)
- Commit messages (clear, imperative style, e.g., *"Add parser for minerU structured output"*)
- Pull requests & code review
- **GitHub workflow**
- Remote repositories (push/pull)
- Using issues and project boards
- Tags/releases for versioned milestones
## 2. Software Engineering Practices
- **Test-Driven Development (TDD)**
- Write tests before implementation
- Unit tests for small methods
- Integration tests for full parser → ChromaDB flow
- **Object-Oriented Design**
- Classes for Parser, Chunk, Storage, etc.
- Encapsulation of functionality
- Interfaces/abstract base classes for extensibility
- **Type Hints & Docstrings**
- PEP 484 typing (`list[str]`, `Optional[dict[str, Any]]`)
- PEP 257 docstring conventions
- Tools: `mypy`, `pydoc`, IDE integration
- **Logging**
- `logging` module instead of `print`
- Log levels: DEBUG, INFO, WARNING, ERROR
- Log to file vs console for debugging and production
## 3. Project Structure
- **Python package layout**
- `src/` or flat layout (`src/parser/`, `src/storage/`)
- `tests/` for unit and integration tests
- **Importable library design**
- `__init__.py` for clean namespace
- Editable install with `pip install -e .` (or `uv pip install -e .`)
- **Configuration management**
- Config files (`config.yaml`, `.env`)
- Avoid hard-coding paths or API keys
## 4. Testing & Quality Assurance
- **Testing frameworks**
- `pytest` (fixtures, parametrization, mocks)
- Coverage tools (`pytest-cov`)
- **Continuous Integration (CI)**
- GitHub Actions for automated testing
- Linting and formatting checks in CI
- **Code Quality**
- Linters: `flake8`, `ruff`
- Formatters: `black`
- Static type checks: `mypy`
## 5. Documentation & Communication
- **Code documentation**
- Inline comments for tricky logic
- Docstrings for functions/classes
- **External documentation**
- README with installation & usage
- CONTRIBUTING.md if collaborative
- **API Documentation**
- Tools like `Sphinx` or `mkdocs` if project grows
## 6. Development Environment
- **Environment management**
- `uv`/`mamba`/`venv` for reproducible environments
- `pyproject.toml` for dependencies
- **Editable installs**
- `pip install -e .` for live development
- **Pre-commit hooks**
- Auto-run `black`, `ruff`, and tests on commit
## 7. Data Science–Specific Practices
- **Experiment tracking**
- Save parser configurations (chunk size, filters) in YAML/JSON
- Log metrics for chunking quality
- **Reproducibility**
- Deterministic runs (random seeds)
- Versioning of datasets and embeddings
- **Integration with ChromaDB**
- Schema for metadata
- Tests for persistence and retrieval