Data science seems like a brand new term but isn’t so. We have always had data science – typically defined as principles, processes and techniques to understand the world around us through analysis of data.
Sometimes, data analysis does not necessarily result into decision making. So what do we need to do to get become a data driven decision making organization? First step is to understand what is generally involved in data science and data driven decision making.
I would have to say that there are two types of data based decisions groups generally identified –
- “Discover” or understand data: This group is often ignored or is not identified as a key element by most organization. This probably comes from a place of hubris – “well, we know our data well!”. However, the new norm (and the fact that more data are available) is to continuously discover data.
- Decisions that repeat: This group is very popular candidate when it comes to data driven decisions. Customer churn is an age-old problem that has haunted even the best marketer.
During the past few years, we have seen tremendous improvements in technology and the natural rise of “Big Data”. So how can we make use of these advances, think analytically at a massive scale and process giant volumes of data on a daily basis?
The answer is mostly related to data processing. It is important to understand that data processing and data science are two separate yet related entities. Data processing is almost critical to maturation of data science.
We previously identified two separate classes of data based decisions.
- “Discover” or understand data: This group requires somewhat traditional approaches to data processing. Generally speaking, data have to be sourced from a wide variety of applications and/or systems. These data tend to be in a wide array of formats (but tends to be mostly structured data). These formats make it difficult to process data. In the past, data warehouses were typically used for data discovery. Now with Big Data, a wider variety of toolsets are available for data processing.
- Decisions that repeat: This type of decision requires slightly different approach to data processing. Generally reporting/monitoring and alerting tools are required and should be used for repeating decisions based on well understood data. However, data warehouses/data lakes or other architectural approaches can be used as well. These type of decisions are also based on data in motion (as opposed to data at rest).
With this basic difference in data processing and data science in mind, it will be interesting to figure out data science approaches and what can be done to fulfill the promise of pure data based decision making.
Now that we have reviewed the basics of data driven decision making categories and have discussed a few differences about how data science will require data processing, we are ready to jump into smaller subset of data mining techniques that are foundational to the data science process.
Following are brief descriptions of data mining techniques:
- Regression or Estimation: Generally you would use regression to predict value of a variable (such as readmission probability for a patient). This technique is quite useful when you are trying to predict one trustworthy value for a variable.
- Similarity matching: Often used to match an individual or group with another individual or group given a finite set of dimensional and measurable attributes. A lot of times organizations can use this to identify customer groups or peer groups
- Classification: This technique is useful when you are attempting to segment or categorize a population of candidates/things. Generally used by marketers to identify positioning and targeting of segments.
- Clustering: There is a fundamental difference between similarity matching (which is for a specific purpose) and clustering (typically used for identifying “natural” groups)