big data

Big data is principally defined by three Vs: - **Volume**: volume greater than can be stored in a [[relational database]] or [[data warehouse]]. - **Variety**: such a variety of data that it cannot be modeled with a [[schema]]; largely unstructured data (e.g., text, audio, video). - **Velocity**: data are generated at such a high rate that traditional storage and analysis schemes are not practical. Many other "V"s have been proposed to define and describe big data, including veracity, value, variability, visualization, validity, vulnerability, and volatility. Big data can be hard to clean; difficult to derive value from; inconsistent across time or data source; hard to visualize; occasionally invalid, irrelevant or incorrect, hard to secure, and may need to be deleted (or simplified) over time to limit data storage requirements. ## big data architecture A big data project will involve multiple layers. - **Data sources:** the set of data sources that must be ingested (e.g., relational databases, social networks, text, multimedia) - **Data ingestion:** the process of gathering and validating data - **Data collector:** transports data to storage - **Data storage:** a physical store of the data that supports all data types and formats; designed to support the volume, variety and velocity of the data - **Data processing:** data are selected, cleaned and processed; processing can be batched, real-time, or hybrid. - **Query & Analysis:** returns desired data and supports analysis for decision making - **Visualization:** communicates insights through custom visualizations and real-time dashboards. - **Data security:** protects the data through authentication, access control, encryption and auditing - **Data monitoring:** a layer for overseeing all of the above layers. ## best practices in big data - Make sure you actually have big data and require big data solutions - Be clear about the objectives, make sure the system will support decision making - Authorize file access with a predefined security policy - Safeguard data at rest - Implement testing and start small - Use [[Agile]] processes ## big data technologies Specific technologies have been developed for big data storage, query and analysis. - **Storage options** include [[Hadoop]], Gluster, [[Amazon S3]], and others. - **Query & Analysis** options include Hive, [[SPARQL]], [[Elasticsearch]], Redshift, Presto, and others. - **Visualization** options include Kibana (for Elasticsearch) and others