chenDenseRetrievalWhat2024

## Dense X Retrieval: What Retrieval Granularity Should We Use? > [!Abstract]- > Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks. Moreover, constructing prompts with fine-grained retrieved units for retrieval-augmented language models improves the performance of downstream QA tasks given a specific computation budget. > [!Cite]- > Chen, Tong, Hongwei Wang, Sihao Chen, et al. “Dense X Retrieval: What Retrieval Granularity Should We Use?” arXiv:2312.06648. Preprint, arXiv, October 4, 2024. [https://doi.org/10.48550/arXiv.2312.06648](https://doi.org/10.48550/arXiv.2312.06648). > > [link](http://arxiv.org/abs/2312.06648) [online](http://zotero.org/users/17587716/items/C6A2X87U) [local](zotero://select/library/items/C6A2X87U) [pdf](file://C:\Users\erikt\Zotero\storage\VBBEX4X9\Chen%20et%20al.%20-%202024%20-%20Dense%20X%20Retrieval%20What%20Retrieval%20Granularity%20Should%20We%20Use.pdf) ## Notes %% begin notes %% Proposes the decomposition of text into propositions--self-contained contextualized statements of facts--for indexing and retrieval in RAG and Q&A systems. Shows improvement over sentence and passage level chunking, largely due to the performance on low-frequency entities in the corpus. The trained a model called Propositionizer, available on HuggingFace [here](https://huggingface.co/chentong00/propositionizer-wiki-flan-t5-large). %% end notes %% %% begin annotations %% ### Imported: 2025-08-08 1:23 pm Dense retrievers are a popular class of techniques for accessing external information sources for open domain NLP tasks (Karpukhin et al., 2020). Before we use a learned dense retriever to retrieve from a corpus, an imperative design decision we have to make is the retrieval unit – i.e. the granularity at which we segment and index the retrieval corpus for inference. In this paper, we investigate an overlooked research question with dense retrieval inference – at what retrieval granularity should we segment and index the retrieval corpus? Based on our empirical experiments, we discover that selecting the proper retrieval granularity at inference time can be a simple yet effective strategy for improving dense retrievers’ retrieval and downstream QA performance. To address these shortcomings of typical retrieval units such as passages or sentences, we propose using proposition as a novel retrieval unit for dense retrieval. Propositions are defined as atomic expressions within text, where each encapsulates a distinct factoid and is presented in a concise, self-contained natural language format. We conduct experiments on five different open-domain QA datasets and empirically compare the performance of four dual-encoder retrievers when Wikipedia is indexed by passages, sentences, and our proposed propositions. Notably, our findings indicate that proposition-based retrieval outperforms sentence and passage-based retrieval, especially in terms of generalization Furthermore, we observe a distinct advantage of proposition-based retrieval in downstream QA performance when using retrieval-augmented language models Here, propositions represent atomic expressions of meanings in text (Min et al., 2023) with three defining principles below. 1. Each proposition should correspond to a distinct piece of meaning in text, where the composition of all propositions would represent the semantics of the entire text. 2. A proposition should be minimal, i.e. it cannot be further split into separate propositions. 3. A proposition should be contextualized and self-contained (Choi et al., 2021). A proposition should include all the necessary context from the text (e.g. coreference) to interpret its meaning. The use of proposition as a retrieval unit is inspired by a recent line of work (Min et al., 2023; Kamoi et al., 2023; Chen et al., 2023a,b), which finds success in representing and evaluating text semantics at the level of propositions. To segment the Wikipedia pages into propositions, we finetune a text generation model, which we refer to as the Propositionizer. The Propositionizer takes a passage as input and generates the list of propositions within the passage. Following Chen et al. (2023b), we train the Propositionizer with a two-step distillation process. We first prompt GPT-4 (OpenAI, 2023) with an instruction containing the proposition definition and 1-shot demonstration. We include the details of the prompt in Figure 8. We start with a set of 42k passages and use GPT-4 to generate the seed set of paragraph-to-proposition pairs. Next, we use the seed set to finetune a Flan-T5-large model (Chung et al., 2022). We refer to the processed corpus as FACTOIDWIKI. we estimate the frequency of error cases where (1) a proposition is not fully supported by the passage, (2) a proposition can be further split into separate propositions, and (3) propositions are not self-contained Surprisingly, despite all of the dense retrieval models being trained on only passage-level documents, all the models demonstrate on-par or superior performance when the corpus is indexed at the proposition level. Our results suggest that indexing the corpus at the finergrained units improves the cross-task generalization on passage retrieval. Our results show the advantage of retrieval on proposition-level index in cross-task generalization settings. Across all four dense retrievers, we observe that retrieving by proposition shows a much larger advantage over retrieving by passages with questions targeting less common entities. As the frequency of entities increases, the performance gap decreases. Our findings indicate that the performance gain from retrieval by proposition can mostly be attributed to queries for long-tailed information. Intuitively, compared to sentences or passages as retrieval units, the advantage of propositions is that the retrieved propositions have a higher density of relevant information to the query. With finergrained retrieval units, the correct answer to the query would more likely appear in the top-l retrieved words by a dense retriever. Recent works on dense retrievers typically adopt a dual-encoder architecture (Yih et al., 2011; Reimers and Gurevych, 2019; Karpukhin et al., 2020; Ni et al., 2022). With dual-encoders, each query and document is encoded into a lowdimensional feature vector respectively, and their relevance is measured by a non-parametric similarity function between the embedding vectors (Mussmann and Ermon, 2016). Due to the limited expressivity from the similarity function, dual encoder models often generalize poorly to new tasks with scarce training data (Thakur et al., 2021). Previous studies use techniques such as data augmentation (Wang et al., 2022; Yu et al., 2023a; Izacard et al., 2022; Gao and Callan, 2022; Lin et al., 2023; Dai et al., 2023), continual pre-training (Chang et al., 2020; Sachan et al., 2021; Oguz et al., 2022), taskaware training (Xin et al., 2022; Cheng et al., 2023), hybrid sparse-dense retrieval (Luan et al., 2021; Chen et al., 2022), or mixed strategy retrieval (Ma et al., 2022, 2023) and so on to improve cross-task generalization performance of dense retrievers. The motivation of our work echoes in part with multi-vector retrieval, e.g. ColBERT (Khattab and Zaharia, 2020), DensePhrase (Lee et al., 2021a,b), ME-BERT (Luan et al., 2021), and MVR (Zhang et al., 2022), where the retrieval model learns to encode a candidate retrieval unit into multiple vectors to increase model expressivity and improve retrieval granularity (Seo et al., 2019; Humeau et al., 2019). Our work instead focuses on the setting where we do not update the dense retriever model or its parameters. We show that indexing the retrieval corpus by different granularity can be a simple and orthogonal strategy for improving the generalization of dense retrievers at inference time. The English Wikipedia dump used in this study, released by Bohnet et al., 2022, was selected because it has been filtered to remove figures, tables, and lists, and is organized into paragraphs. We have segmented Wikipedia into three retrieval units for this study: 100-word passage chunks, sentences, and propositions. Paragraphs are divided into 100-word passage chunks using a greedy method. We divide only at the end of sentences to ensure each passage chunk contains complete sentences. Each passage is further segmented into sentences using the widely used Python SpaCy 1 en_cor e_web_lg model. Decomposing the entire Wikipedia corpus requires approximately 500 GPU hours on NVIDIA P100 GPUs using the default implementation in the transformers2 package. We generated a list of propositions from a given paragraph using GPT-4 with a prompt, as shown in Figure 8. After filtering, 42,857 pairs were used to fine-tune a Flan-T5-Large model. We named the model Propositionizer. The AdamW optimizer was used with a batch size of 64, learning rate of 1e-4, weight decay of 1e-4, and 3 epochs. %% end annotations %% %% Import Date: 2025-08-08T13:23:23.660-06:00 %%