A key driver of the rise of the so-called data-driven organisation is the increased awareness and use of data-driven decision-making (DDD) at all levels of the business. This is in a large part because of the increasing availability and quality of collected data, as well as increasing opportunities to make use of it. In this blog post I would like to discuss what DDD is, how DDD relates to data mining, and finally how to approach data mining projects. Much of the material for this post comes from my reading of “Data Science for Business”, by Foster Provost and Tom Fawcett.
Data-Driven Decision-Making
DDD refers to the use of data analysis to drive meaningful decision making. For example, a supermarket may be able to identify triggers that lead to changes in purchasing decisions and manage stock accordingly. Businesses following DDD principles have been found to demonstrate statistically significant improvements in their productivity (See “Data Science for Business” and references therein). However, that is not to say use of DDD should entirely preclude the use of intuition to help inform business decision: rather DDD should be another component to decision processes.
DDD itself relates not just to the use of data science and data mining techniques in isolation, but also to its automation, sometimes referred to specifically as “automated DDD”. It is with automated DDD that practices typical of data science or data mining really come into their own as distinct from other analytical techniques such as statistics, database queries etc. This is precisely because data mining allows a business to automate the search for knowledge and pattern recognition, although in reality though, there may always be an unavoidable manual aspect to this knowledge discovery process.
Previously I referred to “data science” and “data mining” separately. The distinction between the two concepts is always a bit unclear, however this separation is often useful to be able to refer to specific subtasks comprising the data mining cycle (see CRISP-DM below) as separate from the broader field of data science.
CRISP-DM
Having established the desirability of DDD, how do we achieve it? This is where we need to begin by first identifying the business problems we are facing or want to investigate, and see how to approach these using data mining.
Despite the seeming variety of business problems, they largely fit into one of a number of well known data mining tasks, such as classification, regression, and clustering. It is of course quite a skill to correctly identify precisely the kind of problems being addressed so as to decide early on what kind of approach to follow. This process of identification largely follows using a combination of understanding the kind of business problem being addressed and what data is available. Once the kind of data mining task is established, it becomes possible to approach it in a systematic way.
The high-level approach to engaging with business problems with data mining techniques is fairly well established, and is formalised by the framework known as the “Cross Industry Standard Process for Data Mining” or CRISP-DM. The following process diagram gives the relationship between the different phases of CRISP-DM. Diagram by Kenneth Jensen, distributed under a CC BY-SA 3.0 license
There is a lot of detail that can be adeed to this, but to highlight the main features,
- CRISP-DM is cyclical in its very nature, given by the large circle bounding the diagram. This indicates an iterative approach to data projects
- The starting point for any iteration should be with “business understanding” step
- There are also multiple cycles within the overall CRISP-DM cycle, such as that between the tasks of “business understanding” and “data understanding”
- A complete cycle from “deployment” to “business understanding” will usually follow from the successful completion of a project. This can follow from as new insights are generated by a previously successful model
The CRISP-DM diagram however does not perhaps do a good job of capturing the exploratory nature of data mining projects. This is much more so the case with data mining projects than typical software development projects due to the greater degree of inherent uncertainty e.g. overall project expectations may change from subtask to subtask. This may require a greater reliance on prototyping as opposed to iterative releases.
All the best,
Tom