Data Mining Tutorial

Data Mining is the process of extracting useful information from large database.

Data Mining Tutorial

Learn the concepts of Data Mining with this complete Data Mining Tutorial. Useful for beginners, this tutorial discusses the basic and advance concepts and techniques of data mining with examples. Freshers, BE, BTech, MCA, college students will find it useful to develop notes, for exam preparation, solve lab questions, assignments and viva questions.

Who is this Data Mining Tutorial designed for?

This tutorial is especially useful for beginners wanting to learn data mining for their studies and exams.

What do I need to know to begin with?

To start learning data mining, you should have a good knowledge of database and data warehousing concepts.

Data Mining syllabus covered in this tutorial

This tutorial covers - Pattern and technologies in Data mining, KDD, OLAP, Knowledge Representation, Associations in Data Mining, Classification, Regression, Clustering, Mining Text & Web, Reinforcement Learning etc.

Data Mining

  • Data mining is the process of extracting the useful information, which is stored in the large database.
  • It is a powerful tool, which is useful for organizations to retrieve the useful information from available data warehouses.
  • Data mining can be applied to relational databases, object-oriented databases, data warehouses, structured-unstructured databases, etc.
  • Data mining is used in numerous areas like banking, insurance companies, pharmaceutical companies etc.

Patterns in Data Mining

1. Association
The items or objects in relational databases, transactional databases or any other information repositories are considered, while finding associations or correlations.

2. Classification
  • The goal of classification is to construct a model with the help of historical data that can accurately predict the value.
  • It maps the data into the predefined groups or classes and searches for the new patterns.
For example:
To predict weather on a particular day will be categorized into - sunny, rainy, or cloudy.

3. Regression
  • Regression creates predictive models. Regression analysis is used to make predictions based on existing data by applying formulas.
  • Regression is very useful for finding (or predicting) the information on the basis of previously known information.
4. Cluster analysis
  • It is a process of portioning a set of data into a set of meaningful subclass, called as cluster.
  • It is used to place the data elements into the related groups without advanced knowledge of the group definitions.
5. Forecasting
Forecasting is concerned with the discovery of knowledge or information patterns in data that can lead to reasonable predictions about the future.

Technologies used in data mining

Several techniques used in the development of data mining methods. Some of them are mentioned below:

1. Statistics:

  • It uses the mathematical analysis to express representations, model and summarize empirical data or real world observations.
  • Statistical analysis involves the collection of methods, applicable to  large amount of data to conclude and report the trend.

2. Machine learning

  • Arthur Samuel defined machine learning as a field of study that gives computers the ability to learn without being programmed.
  • When the new data is entered in the computer, algorithms help the data to grow or change due to machine learning.
  • In machine learning, an algorithm is constructed to predict the data from the available database (Predictive analysis).
  • It is related to computational statistics.
The four types of machine learning are:

1. Supervised learning
  • It is based on the classification.
  • It is also called as inductive learning. In this method, the desired outputs are included in the training dataset.
2. Unsupervised learning
Unsupervised learning is based on clustering. Clusters are formed on the basis of similarity measures and desired outputs are not included in the training dataset.

3. Semi-supervised learning
Semi-supervised learning includes some desired outputs to the training dataset to generate the appropriate functions. This method generally avoids the large number of labeled examples (i.e. desired outputs) .

4. Active learning
  • Active learning is a powerful approach in analyzing the data efficiently.
  • The algorithm is designed in such a way that, the desired output should be decided by the algorithm itself (the user plays important role in this type).

3. Information retrieval

Information deals with uncertain representations of the semantics of objects (text, images).
For example: Finding relevant information from a large document.

4. Database systems and data warehouse

  • Databases are used for the purpose of recording the data as well as data warehousing.
  • Online Transactional Processing (OLTP) uses databases for day to day transaction
  • To remove the redundant data and save the storage space, data is normalized and stored in the form of tables.
  • Entity-Relational modeling techniques are used for relational database management system design.
  • Data warehouses are used to store historical data which helps to take strategical decision for business.
  • It is used for online analytical processing (OALP), which helps to analyze the data.

5. Decision support system

  • Decision support system is a category of information system. It is very useful in decision making for organizations.
  • It is an interactive software based system which helps decision makers to extract useful information from the data, documents to make the decision.

KDD and Data mining

  • The process of discovering knowledge in data and application of data mining techniques are referred to as  Knowledge Discovery in Databases (KDD).
  • KDD consists of various application domains such as artificial intelligence, pattern recognition, machine learning and data visualization.
  • The main goal of KDD is to extract knowledge from large databases with the help of data mining methods.
The different steps of KDD are as given below:

1. Data cleaning:
In this step, noise and irrelevant data are removed from the database.

2. Data integration:
In this step, the heterogeneous data sources are merged into a single data source.

3. Data selection:
In this step, the data which is relevant to the analysis process gets retrieved from the database.

4. Data transformation:
In this step, the selected data is transformed in such forms which are suitable for data mining.

5. Data mining:
In this step, the various techniques are applied to extract the data patterns.

6. Pattern evaluation:
In this step, the different data patterns are evaluated.

7. Knowledge representation:
This is the final step of KDD, which represents the knowledge.

knowledge discovery in database

KDD vs Datamining

KDDData Mining
It is a field of computer science, that helps to extract useful and previously undiscovered knowledge from the large database by using various tools and theories.Data mining is one of the important steps in the KDD process. It includes suitable algorithm based on the objective of the KDD process to identify the patterns from the database.