corpus

A corpus (plural corpora) is a collection of texts used in [[natural language processing|NLP]] tasks. The corpus used will impact the outcome of NLP tasks. Using Shakespeare to train a model for parsing Amazon reviews may not be fruitful. One should consider the provenance of the dataset, the types of text it contains, the language used, and the context in which it was generated. Corpus creators can build a data statement (datasheet) to accompany a corpus that specifies - metadata - situation - language variety - collection process - annotation process - distribution restrictions