Data mining is the process of applying algorithms to search for patterns within collections of data.
Fundamentally, data mining is the deployment of an automated process for analyzing large amounts of data that otherwise could not be addressed manually. This process generally involves several steps that may include data collection, data cleaning and validation (to address any errors, inconsistencies, and other quality issues), model development and testing, and final deployment.
In this context, model development means the application of one or more of several machine learning (ML) algorithms such as regression, decision trees, support vector machines, as well as many other tools that are part of the modern ML repertoire.
Data mining is one of the more recent applications of statistical analysis for extracting information from data. Within the scientific and social domains, there has always been a need and a refinement of efforts in the collection and processing of data. Previously, data collection was a largely manual effort, as was the processing of this data. However, with modern computers, and the ubiquity of computers in many devices, there are now many more data creation and processing possibilities than ever before. One of the most prevalent and challenging issues for businesses and individuals alike is the abundance of data. The modern affliction is now, amongst all that is available, what is relevant and worthwhile, and what does it mean.
The popularization of machine learning methods has come about as a result of several recent developments: increases in computational processing power, the availability of economical storage, and the production of large data sets from digital devices and other sources. With the rise in popularity of these methods there have been an ever greater number of domains to leverage these tools, which has made data mining a ubiquitous technique in modern analytics. While these techniques and the computational resources now available represent incredible processing potential and efficiencies in working with large quantities of data, there are also trade-offs in their implementation.
One of the most significant drawbacks in the use of data mining is its use in indiscriminate quests to discover patterns and the improper or incomplete vetting of results. Model evaluation should include best practices such as testing models on new or “out of sample” data. Otherwise, models will likely return spurious relationships that are not based on any underlying causal structure. In the age of big data there is an inherent paradox: the greater the size of the data, the greater the likelihood of incidental and meaningless relationships.