data pipeline

#rough YT nullQueries Data Pipelines: How to make them better Data Pipeline steps 1. Collection 2. Ingestion 1. ETL 2. Streaming 3. Storage (data lake) 4. Prep & Analyze 5. Storage (database) 6. Presentation Cloud data platforms - Azure - event hub, iot hub, data factory; data lake; databricks, HDInsight, ML, analysis synapse; powerbi - AWS - lambda, kinesis, iot; s3, glacier; glue, eMR, Athena, Redshift; Quicksight - Google - Pubsub, functions, iot; storage, data flow, dat proc, datalabl, dataprep; bigquery; data studio Features of good data pipelines 1. Auditing and logging: know what was processed and when, which credentials were used, volume of data in the file (check with parameters), task time 2. Error handling: let someone know if it fails 3. Repeatability: avoid uploading duplicate records when re-running the pipeline. (This was the big issue with my dwr-sidewinder project). 1. Detect what is already loaded 2. Land in staging and Upsert 3. De-dup after loaded (worst option) 4. Self-healing: How to re-load old data when it was loaded incorrectly or an error was discovered 5. Decouple EL and T 1. Store raw data in a data lake, transform in a data warehouse (add data types, clean, etc.) 6. CL/CI: Made out of code, commited to Git, versioned, etc. 1. Handle Rollbacks Truncate and Load is a common ETL practice (DROP TABLE then add)