data mining

Data is the new oil. There's a lot of data out there, and more and more being created every day. Data mining is the process of exploiting this resource for knowledge discovery. A discovery is interesting if it is valid, previously unknown, potentially useful, and understandable by humans. Data mining originated from the field of [[statistics]] with the rise of [[database]] technologies. As organizations began storing massive volumes of data in relational databases in the 1980s, the challenge became: How can we extract useful insights from all this stored data? The concept of [[knowledge discovery in databases]] (KDD), which attempted to situate data mining as a technique within a broader framework, was formalized in the 1990s. The field of data mining developed alongside the field of [[base/Statistical Learning/machine learning|machine learning]]. In the early 2000s and 2010s, as data volumes exploded and computing power increased, the two fields began to merge under the emerging field of [[data science]]. ## data mining pipeline A data mining pipeline "transports" data from raw resource to useful application. A data mining pipeline consists of 1. Understanding 2. Pre-processing 3. Warehousing 4. Modeling (Analysis) 5. Pattern evaluation When considering a data mining pipeline, consider these four dimensions 1. **Data**: consider the types of data available and the "5Vs": Volume, Variety, Velocity, Veracity and Value 2. **Application**: consider the intended use of the data and analysis 3. **Knowledge**: the type of discovery desired (descriptive, predictive, prescriptive) 4. **Technique**: frequent pattern analysis, classification/prediction, clustering, anomaly detection, trend and evolution analysis ### the 5Vs of data - Volume - Variety - Velocity - Veracity - Value ### types of data **Relational**: data that can be stored in relations (tables) **Sequential** (temporal): data which has a natural ordering (for example time) **Streaming**: sequential data that arrives over time **Spatial**: data that relates to locations in space (especially geospatial data) **Spatial-temporal**: data that relates to locations in space and time **Text**: natural language **Multi-media**: images, audio, video **Hypertext**: web data or wiki-style linked data **Graph**: network information **Key-value**: NoSQL, JSON ### issues in data mining Diverse data are needed for diverse knowledge Data quality Supervised vs unsupervised learning Performance evaluation Effectiveness vs efficiency Incremental, interactive mining Integration of domain knowledge Visual analytics Privacy Ethics Data ownership Model validity Model bias (algorithmic fairness) Interpretation, application and societal consequence ### data quality - relevance - accessibility - interpretability - reliability - timeliness - accuracy - consistency - precision - granularity (resolution or scale) - completeness ### data quality issues - incomplete (missing values or missing attributes) - noisy data (imprecisions, errors, outliers) - inconsistent (e.g., age versus birthday; rating scales) Causes from data collection, transmission, processing; human errors, hardware errors, software errors; changes over time. ## outlier Outliers or anomalies are data points that differ significantly from the normal data pattern. If this sounds underdefined, it is. **Global outliers** are significantly different than the mass of the data. By knowing the value you can tell if it is an outlier for this dataset or not. **Contextual outliers** are outliers within the context of the data pattern (e.g., a very low temperature reading in summer). **Collective outliers** refer to a group of points that together indicate an anomaly (e.g., distributed denial of service attack). Detection of outliers or anomalies will depend on the dataset and its intended application. [[Classification]] approaches would learn to detect anomalies using labeled examples. [[Clustering]] approaches would learn to detect anomalies based on distance or density to other data points. A global outlier is any outlier that can be identified by simply looking at the data point. Any data point that significantly deviates from the mass of the dataset is a global outlier. For example, an RGB value of 9999 is an outlier in a remote sensing dataset that returns pixel values in the range (0, 255) for each of the three R, G, and B channels. ## time series ## sequence data Sequence data are ordered data. When the ordering is time, the data is a [[time-series]]. Sequence data can typically be decomposed into overall trend, cyclic patterns, random noise, and [[outlier]]. BLAST is a pairwise sequence analysis technique. The [[Markov chain]] model is a popular model for sequence data like stock prices. [[apriori algorithm]] [[FP-growth algorithm]]