test driven development

Test driven development (TDD) is a software development paradigm in which test are written before code. > [!NOTE] Why use TDD? > TDD adds significant overhead to software development (in the short term). However, TDD can be worth the investment by preventing expensive bugs down the line. A bug found late in development can delay delivery, increase costs, and frustrate stakeholders. The process of implementing testing is as follows: 1. [[Scope]] the work down to a tractable problem (e.g., one class or method) 2. Review the inputs and outputs (e.g., a custom parser in LlamaIndex will take as input a `Document` and return a series of `Nodes`). Familiarize yourself with these internals so as not to create integration problems later on. 3. Write the [[requirements]]. These are brief statements of what the code is expected to do, how to handle edge cases, etc. These become [[acceptance criteria]] (almost like [[user stories]]). Identify any invariants early. If you discover a requirement later on, that's ok. Just add it to your list, create a test stub, and make sure to work it in. 4. Sketch the design in [[UML]] or pseudocode. I recommend using an LLM to create the design based on your requirements in a language like [[PlantUML]] so the diagram can become part of the project [[documentation]]. Follow the [[diagrams and documentation as code]] philosophy. 5. Write all of the unit tests as stubs. Tests are written in the `tests/` directory with `test_` appended to the filename that contains the code being tested. Don't worry about writing implementations or even fixtures yet. Tests should check *behavior*, not necessarily implementation details. 6. Pick one test and follow [[red-green-refactor]] workflow. Create a simple fixture if necessary inline with the test to capture all core and edge cases. 7. Continue with all other tests, one at a time, until the code is complete. You might want to focus on bare-bones implementations at this stage rather than incorporate external tools. Add `TODO`s wherever you want to expand the functionality later. [[Complexity is earned]]. 8. Add or improve **logging** and **debugging**. 9. Lock in functionality with [[regression testing]]. Add larger fixtures for integration testing and/or golden output tests with real-world examples. Make sure to normalize your golden outputs (e.g., remove stochastic elements like UUIDs). Tests should be integrated in a [[continuous integration]] system (like [[GitHub Actions]]). If set up properly, tests written for local development will be incorporated in the test suite automatically once merged to the `main` or `dev` branch. ## code coverage Full code coverage is when tests cover all possible states of the code. Components of coverage include - **statement coverage**: all statements have been executed by a test - **branch (edge) coverage**: each edge in a program's control flow graph have been executed - **condition coverage**: like branch coverage but with emphasis on all combinations of conditions (if/then branches and guard clauses in code) - **path coverage**: all paths in a program's control flow graph have been executed multiple times. ## test double A test double is an object that "doubles" an object in code for the purpose of testing. - **stub**: replaces the real component with a canned answer - **spy**: mock or stub that records, implements the 2 or 3 methods of interest - **mock**: replaces real component with pre-programmed expectations - **fake**: replaces real component with shortcuts - **dummy**: placeholder object, never actually used ## test fixture A test fixture is a setup of preconditions or initial conditions for a software application that is needed to run a test. It provides the context in which the tests are executed and can include things like creating objects, setting up data, or initializing systems. A good test fixture should be able to be easily reset to its original state between tests, allowing multiple tests to use the same fixture without interference. In data science, a common test fixture is a formatted dataset that provides a consistent and reproducible example for testing data processing algorithms. This dataset can be used to test various aspects of a model or algorithm, such as its ability to handle missing values, outliers, or edge cases. For unit tests, a good test fixture is a minimal example that contains all of the edge cases and core cases. Provided the fixture is small, it can be stored directly in the code with the test. Small inline fixtures make it easy to include all edge cases and simplifies debugging. However, changes that affect the fixture require updating many small fixtures throughout the test suite. In contrast to unit tests, which often rely on minimal and self-contained fixtures, integration tests benefit from more realistic test fixtures that mimic real-world scenarios. After the code is fully working, a **golden output** can be saved to illustrate exactly how the output should look for the realistic test fixture. An "assert equals" test can be used for regression testing. ## golden output Golden outputs serve as a reference point for future tests, ensuring that any changes to the code or test fixture don't break the expected behavior. For integration and regression testing, create a golden output with [[pytest]] as shown below. In your `tests/` folder, create or edit `confest.py`. ```python import pytest def pytest_addoption(parser): parser.addoption( "--update-golden", action="store_true", default=False, help="Update golden output files" ) @pytest.fixture def update_golden(request): return request.config.getoption("--update-golden") ``` Then write a test in your test file to create the golden output (if it doesn't exist) or compare to the golden output. Importantly, normalize any non-deterministic behavior to ensure the tests pass. This example shows a golden output for a custom parser in [[LlamaIndex]]. ```python def test_custom_parser_golden_output(sample_document, update_golden): parser = CustomParser( include_metadata=True, include_prev_next_rel=True, chunk_overlap=20, ) nodes = parser.get_nodes_from_documents([sample_document], max_chars=200) def normalize(node): # Add code here pass actual = [normalize(n) for n in nodes] golden_file = Path(__file__).parent / "fixtures" / "golden_output.json" if update_golden or not golden_file.exists(): with open(golden_file, "w") as f: json.dump(actual, f, indent=2) pytest.skip("Golden file updated, re-run test without --update-golden") else: with open(golden_file) as f: expected = json.load(f) assert actual == expected ``` To create the first output, run the test and use the `--update-golden` flag. ```bash pytest -k test_custom_parser_golden_output --update-golden ``` In future runs, if the parser logic changes intentionally. ```bash pytest --update-golden ``` Otherwise anytime you run the test suite, the output will be compared to the golden output. If the test fails, you can see a diff of the golden output with the `-vv` flag. ```bash pytest -vv -s -k test_custom_parser_golden_output ``` ## logging during tests Ideally you will build code up iteratively in small cycles and rely on the tests themselves, rather than visualizing output, for debugging. You can run most test suites with the verbose flag `-v` to get detailed information on test failures. However, sometimes you just need to see what's happening to get the code to work. Two options are adding print statements and logging. Don’t leave raw debug logging in your tests permanently. That’s what asserts are for. For longer term use, consider building a **debug harness** that prints outputs. ### print statement Adding a print statement in the test code is sometimes the easiest way to get a picture of what your code is doing. Run [[pytest]] with the `-s` flag to allow show these print statements when running tests. ```bash uv run pytest -s ``` ### logging Logging allows more detailed investigation. Configure logs with the Python `logging` library and run [[pytest]] with logging to save these logs. ```bash uv run pytest --log-cli-level=INFO ``` Again, don't forget to remove these logs from the tests before committing. ```python import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) ``` ### debug harness A **debug harness** is a test function that prints or logs for the purposes of debugging. Mark with `@pytest.mark.skipif` so it never clutters your [[continuous integration]]. For example, ```python @pytest.mark.skipif( not os.getenv("RUN_DEBUG", False), reason="Set RUN_DEBUG=1 to enable" ) ``` To run the debug harness locally with [[pytest]], use the `-k` flag followed by the function name. ```bash RUN_DEBUG=1 uv run pytest -s -k <debug_function> ``` ## dependency injection ## verification Verification is the process of ensuring your code meets engineering requirements. Unit tests and integration tests fall under verification. **Formal verification** entails constructing a proof that the program is correct. This is very rare but necessary for highly sensitive applications. ## validation Validation is the process of ensuring your code meets application requirements. Acceptance tests and usability tests fall under validation. ## unit test A unit test covers the low level aspects of a system, focused on the expected behavior of the system. Unit tests follow the paradigm arrange, act, assert. - **Arrange**: Get things ready to test--insatiate an object or generate data. - **Act**: decide which single action you'll perform to check - **Assert**: use an appropriate check (with `assert`) to ensure the code worked as expected ## integration test Integration tests check that the modules work together in combination. ## acceptance test Acceptance tests are tests performed by the user (or product owner) to check that the delivered system meets their needs. Often these are written first to inform integration tests and unit tests. This is called **outside-in testing**. ## regression test Regression testing refers to tests that are included to ensure the code base does not "regress" and trip on a previous bug. > [!Tip]- Additional Resources > - [Test Driven Development by Kent Beck](https://www.amazon.com/Test-Driven-Development-Addison-Wesley-Signature-ebook/dp/B0CW1JBTHM/) # TDD Lessons Learned for `CustomParser` This document captures what we learned while developing the `CustomParser` for MinerU structured output → LlamaIndex nodes, using **test-driven development (TDD)**. --- ## ✅ What We Did 1. **Design First (UML sketch)** - Started with a basic UML diagram of `CustomParser` inheriting from `NodeParser`. - Clarified responsibilities: parse text, tables, images into nodes. 2. **Test Case Enumeration** - Identified all element types and edge cases: - Concatenate short text. - Split long text. - Skip empty text. - Tables/images atomic. - Metadata propagation (pages, captions, headers). - Prev/Next relationships. - No chunking across headers. 3. **TDD Cycle (Red → Green → Refactor)** - Wrote failing tests (red). - Implemented the minimal logic to pass (green). - Cleaned up code and flushed buffers/refactored (refactor). 4. **Progressive Refinement** - Started with a simple regex sentence splitter for control and visibility. - Left a TODO to swap in LlamaIndex’s `SentenceSplitter`/`SemanticSplitter`. - Kept tables/images atomic from the beginning. 5. **Specification Correction Mid-Stream** - Discovered a missed requirement: **text should not be chunked across headers**. - Added a failing test, fixed the code, and locked in the invariant. 6. **Regression Safety** - Added a **golden-output test**, normalized to remove nondeterminism (IDs, tuples, enums). - Ensures future refactors won’t silently change behavior. --- ## 🔹 Lessons Learned - **Capture invariants early.** Example: “Text must never cross headers.” If we had written that down up front, we’d have avoided one missed requirement. - **Golden tests are essential.** Unit tests are great for local invariants, but only golden regression tests protect the *overall shape* of output against unintended changes. - **Normalization matters.** Nondeterministic fields (UUIDs, enums, tuples vs. lists) must be normalized to make golden tests stable. - **Iterative design is okay.** Starting with a simple implementation (regex splitter) was the right call. It kept complexity down until we were ready for LlamaIndex’s advanced splitters. - **Logging/debug harness helps understanding.** Adding selective logging/tests that print intermediate nodes sped up debugging and built confidence in correctness. --- ## 🔹 Suggested Structured Approach Next Time 1. **Requirements Capture** - Write down all invariants and acceptance criteria before coding. 2. **Design Sketch** - UML or pseudocode to clarify responsibilities and boundaries. 3. **Test Plan First** - Write test stubs (with TODOs) for every case. - Don’t write implementation yet. 4. **TDD Loop** - Implement tests one by one: red → green → refactor. 5. **Refactor for Quality** - Swap naive implementations for library-backed ones. - Add chunk overlaps, clean metadata, improve performance. 6. **Lock with Regression Safety** - Golden tests and integration tests with real MinerU outputs. - Run in CI to prevent regressions. --- ## 🎯 Takeaway We successfully followed a **structured TDD process**: - Planned with UML. - Enumerated tests before code. - Iterated in red-green-refactor cycles. - Corrected specs quickly when a requirement was missed. - Added golden regression safety. The process worked well. The only improvement would be to **write down invariants earlier** to avoid mid-stream corrections.