#rough
YT nullQueries Data Pipelines: How to make them better
Data Pipeline steps
1. Collection
2. Ingestion
1. ETL
2. Streaming
3. Storage (data lake)
4. Prep & Analyze
5. Storage (database)
6. Presentation
Cloud data platforms
- Azure
- event hub, iot hub, data factory; data lake; databricks, HDInsight, ML, analysis synapse; powerbi
- AWS
- lambda, kinesis, iot; s3, glacier; glue, eMR, Athena, Redshift; Quicksight
- Google
- Pubsub, functions, iot; storage, data flow, dat proc, datalabl, dataprep; bigquery; data studio
Features of good data pipelines
1. Auditing and logging: know what was processed and when, which credentials were used, volume of data in the file (check with parameters), task time
2. Error handling: let someone know if it fails
3. Repeatability: avoid uploading duplicate records when re-running the pipeline. (This was the big issue with my dwr-sidewinder project).
1. Detect what is already loaded
2. Land in staging and Upsert
3. De-dup after loaded (worst option)
4. Self-healing: How to re-load old data when it was loaded incorrectly or an error was discovered
5. Decouple EL and T
1. Store raw data in a data lake, transform in a data warehouse (add data types, clean, etc.)
6. CL/CI: Made out of code, commited to Git, versioned, etc.
1. Handle Rollbacks
Truncate and Load is a common ETL practice (DROP TABLE then add)