A database organizes data through entities, attributes and relationships.
- **Entities** are the abstract representations of the objects of interest (analogous to a [[Class]] in programming).
- **Attributes** are the characteristics of interest for which data are stored.
- **Relationships** are associations between two or more entities. These are named with a verb by convention.
These three are often described using an [[entity-relationship diagram]].
Each individual record is referred to as an **instance**. Each instance will have an **identifier**, which is a special attribute used to identify a specific instance.
Databases include user data, metadata, application data, and indices.
- **User data** refers to the data users work with, often stored in tables.
- **Metadata** are data about data. These describe the data including table name, column name, data type, primary keys, etc. Metadata are stored in system tables accessible only to administrators.
- **Indices** store representations of data to facilitate storing, searching and sorting.
- **Application metadata** include forms, reports and queries.
## buffer cache
A [[database management system]] will load any data that is read from disk or written back to the [[database]] in buffer cache.
## transactions
Operations on the database are grouped into transactions. This helps ensure [[ACID]] compliance. A transaction is also known as a logical unit of work (LUW). Either all the actions in a transaction are committed or none are committed.
Transactions are **interleaved** meaning that the operating system quickly switches CPU services among tasks so some portion of each task is carried out in a given interval, but the transactions appear concurrent. In other words, multiple transactions can execute at the same time without compromising data integrity. This is called **concurrency**.
Concurrency entails challenges including
- **Lost update**: one user's changes overlay or interfere with another's
- **Inconsistent reads**: two users can read the same thing and get different results. Reading from the buffer cache can result in an inconsistent read known as a **dirty read**.
- **Deadlocks**: also known as "deadly embrace", occurs when two transactions rely on resources locked by each other.
### serializable transactions
Refers to two transactions that execute in parallel but the result is the same as if the transactions had run separately, leaving the database in a consistent state. Transactions are fully isolated from one another.
When a second transaction requires a resource locked by another transaction, the second transaction will wait on the first transaction to complete.
If the two transactions are in a deadlock, the DBMS will abort one of the transactions, requiring it to be resubmitted.
### locking
The DBMS will lock any resources required by a transaction to preserve data integrity. Resources can be locked at multiple levels of granularity, from row-level locks to database-level locks. Row-level locks are most common but do involve more overhead.
### transaction logs
Operations on the database are recorded on the transaction log. The transactions logs can be used to rollback a transaction, recover a database, or backup (replicate) a database.
Logs are written before the operations occur. Periodically the log files are written ("flushed") to disk. This can be scheduled after a period of time, after so much data is accumulated, or manually with a `FLUSH LOGS` command.
For each operation, the log file includes the transaction ID, a sequence number, the previous sequence number, the object, the type of operation, an image of the row before the operation, an image of the row after the operation, and other metadata.
Transaction logs have a variety of names, including "Binary Logs" or "Bin Logs" in [[MySQL]], "Write Ahead Log" in [[Postgres]], "RedoLogs" in Oracle and "Transaction Logs" in MS SQL Server.
Log files are in binary but can be human readable with a utility provided by the DBMS.
### commit
The commit is when the updated cache buffers are written to disk.
### rollback
A rollback reverses a commit by replaying the log files but in reverse.
## database backup
For most organizations, loss of data and offline time are not acceptable. Database backups are essential to ensuring data availability. When data availability is critical, a standby-by copy of the primary database will be maintained. Applications can read from either at any time provided the backup is updated synchronously (called a "hot standby").
A **full backup** is when the entire database is copied. An **incremental backup** will copy only changes made since the last backup.
A **cold backup** is when the primary database is shut down for the time its being backed up. A **warm backup** does not require downtime, however it is slower.
In many cases, transaction logs are used to replicate transactions that occur on the primary database to the secondary database. This is called **replication**.
**Synchronous replication** applies transactions from the logs one at a time and are typically safer. **Asynchronous replication** applies transactions in parallel, which is faster but may result in inconsistent data.
During a backup, the transaction logs are also backed up.
To recover a database, find the most recent backup and apply any transactions not yet applied to the backup since the last backup.
## CAP
[[google sheets]]
[[database management system]]
[[knowledge discovery in databases]]
[[from data to insight]]
[[relational database]]
[[Postgres]]
[[SQL]]
[[database normalization]]
[[database markup language]]
[[schema]]
[[entity-relationship diagram]]
[[data warehouse]]
[[Snowflake]]
[[vector database]]
[[Chroma]]
[[project organization for databases]]
[[ACID]]
https://core0.staticworld.net/assets/media-resource/16281/infoworld_which_database_deep_dive.pdf