Introduction to Data Mining and Machine Learning Techniques

Harshali Patel
12 min readAug 7, 2018

--

1. Objective

In this blog, we will study what is Data Mining. Also, will study data mining scope, foundation, data mining techniques and terminologies in Data Mining. As we study this, will learn data mining architecture with a diagram. Further, will study knowledge discovery. Along with we will also learn data mining applications and pros and cons.

2. Introduction to Data Mining

Data Mining is a set of method that applies to large and complex databases. This is to eliminate the randomness and discover the hidden pattern. As these data mining methods are almost always computationally intensive. We use data mining tools, methodologies, and theories for revealing patterns in data. There are too many driving forces present. And, this is the reason why data mining has become such an important area of study.

3. Data Mining History

In 1960s statisticians used the terms “Data Fishing” or “Data Dredging”. That was to refer what they considered the bad practice of analyzing data. The term “Data Mining” appeared around 1990 in the database community.

4. Foundation of Data Mining

We use data mining techniques for a long process of research and product development. As this evolution was started when business data was first stored on computers. Also, it allows users to navigate through their data in real time. We use data mining in the business community because it is supported by three technologies that are now mature:

  • Massive data collection
  • Powerful multiprocessor computers
  • Data mining algorithms

5. Type of data gathered

a. Business transactions

In this business industry, every transaction is “memorized” for perpetuity. We can say many transactions are dealing with time and can be inter-business deals such as purchases, exchanges, banking, stock, etc.,

b. Scientific data

Everywhere, our society is amassing colossal amounts of scientific data. As that scientific data need to be analyzed. Unfortunately, we have to capture and store more new data faster. Then we can analyze the old data already accumulated.

c. Medical and personal data

As we can say from the government to customer and for personal needs, we have to gather large information. That information is required for individuals and groups.

When correlated with other data, this information can shed light on customer behavior.

d. Surveillance video and pictures

As with the collapse of video camera prices, video cameras are becoming ubiquitous. Also, we can recycle cameras, videotapes from surveillance. However, it’s become a trend to store the tapes and even digitize them for future use and analysis.

e. Games

In societies, a huge amount of data and statistics is used. That is to collect about games, players, and athletes. As this information data is used by commentators and journalists for reporting.

f. Digital media

There are too many reasons for causes of the explosion in digital media repositories. Such as cheap scanners, desktop video cameras, and digital cameras. Associations such as the NHL and the NBA. That have already started converting their huge game collection into digital forms.

g. CAD and Software engineering data

There are multiple CAD systems for architects present to design building. As these systems are used to generate a huge amount of data.

Moreover, we can use S.E is a source of considerable similar data with code and objects that needs to be powerful tools for management and maintenance.

h. Virtual Worlds

Nowadays many applications are using three-dimensional virtual spaces. Also, these spaces and the objects they contain have to describe with special languages such as VRML. Ideally, we have to define virtual spaces as they can share objects and places. Also, there present the remarkable amount of virtual reality object available.

i. Text reports and memos (e-mail messages)

As communications are based on the reports and memos in textual forms in many companies. As they are exchanged by e-mail. Although, we use to store it in digital form for future use. Also, reference creating formidable digital libraries.

6. Uses of Data Mining

a. Automated prediction of trends and behaviors

We use to automate the process of finding predictive information in large databases. Questions that required extensive hands-on analysis can now be answered from the data. Targeted marketing is a typical example of predictive marketing. As we also use data mining on past promotional mailings. That is to identify the targets to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default. And identifying segments of a population likely to respond similarly to given events.

b. Automated discovery of previously unknown patterns

As we use data mining tools to sweep through databases. Also, to identify previously hidden patterns in one step. There is a very good example of pattern discovery. As it is the analysis of retail sales data. That is to identify unrelated products that are often purchased together. Also, there are other pattern discovery problems. That includes detecting fraudulent credit card transactions. It is identified that anomalous data could represent data entry keying errors.

7. Data Mining Techniques

a. Artificial neural networks

We use data mining in non-linear predictive models. As this learn through training and resemble biological neural networks in structure.

b. Decision trees

As we use tree-shaped structures to represent sets of decisions. Also, by this rules are generated for the classification of a dataset. These decisions generate rules for the classification of a dataset. As there are specific decision tree methods that includes Classification and Regression Trees and Chi-Square Automatic Interaction Detection (CHAID).

c. Genetic algorithms

There are the present genetic combination, mutation, and natural selection for optimization techniques. That is design based on the concepts of evolution.

d. Nearest neighbor method

A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) like. it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.

e. Rule induction

The extraction of useful if-then rules from data based on statistical significance.

8. Data Mining Terminologies

a. Notation

Input X: X is often multidimensional.

Each dimension of X is denoted by Xj and is referred to as a feature variable or , variable.

Output Y: called the response or dependent variable.

A response is available only when learning is supervised.

b. Nature of Data Sets

a. Quantitative: Measurements or counts, recorded as numerical values, e.g. Height, Temperature, # of Red M&M’s in a bag

b. Qualitative: Group or categories

c. Ordinal: possesses a natural ordering, e.g. Shirt sizes (S, M, L, XL)

d. Nominal: just name of the categories, e.g. Marital Status, Gender,

Color of M&M’s in a bag

9. Why Data Mining

As data mining is having spacious applications. Thus, it is the young and promising field for the present generation. It has attracted a great deal of attention in the information industry and in society. Due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Thus, we use information and knowledge for applications ranging from market analysis. This is the reason why data mining is also called as knowledge discovery from data.

10. Data Mining Architecture

We need to apply advanced techniques in the best way. As they must be fully integrated with a data business analysis tools. To operate data mining tools we need extra steps for the extracting, and importing the data.

Furthermore, new insights need operational implementation, integration with the warehouse simplifies the application. We have to apply analytic data warehouse to improve business processes. Particularly in areas such as promotional campaign management, and so on.

Below figure illustrates an architecture for advanced analysis in a large data warehouse.

The ideal starting point is a data warehouse that must contain a combination of internal data tracking all customer contact. This should coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. Although, this warehouse can be implemented in a variety of relational database systems.

Such as Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access.

An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model. That need to apply when navigating the data warehouse. Although, multidimensional structures allow the user to analyze the data. As they want to view their business. Such as summarizing by product line, region.

Further, the Data Mining Server must be integrated with the data warehouse. And, the OLAP server to embed ROI-focused business analysis directly into this infrastructure. Also, integration with the data warehouse enables the operational decisions. That is to be implemented and tracked.

Also, keep warehouse grows with new decisions and results. Thus, the organization can mine the best practices and apply them to future decisions

In the OLAP, results enhance the metadata. That is by providing a dynamic metadata layer. As this layer is used to represents a distilled view of the data. Reporting, visualization, and tools can then be applied to plan future actions. And confirm the impact of those plans.

11. Data Mining Process

Data Mining, also popularly known as Knowledge Discovery in Databases (KDD). Also, nontrivial extraction of implicit information from data in databases.

This process comprises of a few steps. That is to lead from raw data collections to some form of new knowledge. The iterative process consists of the following steps:

a. Data cleaning

This is also called as data cleansing. As in this phase noise data and irrelevant data are removed from the collection.

b. Data integration

In this multiple data is combined at the same place.

c. Data selection

We have to decide the data relevant to the analysis is decided on and retrieved from the data collection.

d. Data transformation

It is also a data consolidation method. Also, it’s a phase in which the selected data is transformed into forms. That are appropriate for the mining procedure.

e. Data mining

In this, we have to apply clever techniques to extract patterns potentially useful.

f. Pattern evaluation

In this process interesting patterns representing knowledge are identified based on given measures.

g. Knowledge representation

It is the final phase. Particularly in this phase, knowledge is discovered and represented to the user. This essential step uses visualization techniques. That help users understand and interpret the data mining results.

12. Categories of Data Mining Systems

As there are too many data mining systems available. Also, some systems are specific that we need to dedicate to a given data source. Further, according to various criteria, data mining systems have to categorize.

a. Classification according to the type of data source mined

According to the type of data handle, have to perform classification of data mining. Such as spatial data, multimedia data, time-series data, text data, World Wide Web, etc.

b. Classification according to the data model drawn on

In this classification is done on the basis of a data model. Such as relational database, object-oriented database, data warehouse, transactional, etc.

c. Classification according to the king of knowledge discovered

In this classification it is been done on the basis of the kind of knowledge. Such as characterization, discrimination, association, classification, clustering, etc.

d. Classification according to mining techniques used

As data mining systems employ are used to provide different techniques. According to the data analysis, we have to done this classification. Such as machine learning, neural networks, genetic algorithms, , etc.

13. Issues in Data Mining

a. Mining methodology issues

These issues to the data mining approaches applied and their limitations such as versatility of the mining approaches that can dictate mining methodology choices.

b. Performance issues

As there is much artificial intelligence and statistical methods exist. That is use for data analysis. However, these methods were often not designed for the very large datasets. And data mining is dealing with today. As Terabyte sizes are common.

We can say this raises the issues of scalability and efficiency of the data mining methods. That would process considerably large data. . Moreover, Linear algorithms are usually the norm. In the same theme, sampling can be used for mining instead of the whole dataset.

However, issues like completeness and choice of samples may arise. Other topics in the issue of performance are incremental updating and parallel programming. We use parallelism to solve the size problem. And if the dataset can be subdivided and the results can be merged later.

Incremental updating is important for merging results from parallel mining. That the new data becomes available without having to re-analyze the complete dataset.

c. Data source issues

We must know that there are many issues related to the data sources. Some are practical such as the diversity of data types. While others are philosophical like the data glut problem.

We certainly have an excess of data since. Also, we already have more data than we can handle. Then we are still collecting data at an even higher rate. Although, If the spread of database management systems. That has helped in increasing the gathering of information. And the advent of data mining is certainly encouraging more data harvesting. The current practice is to collect as much data as possible now and process it or try to process it, later.

Regarding the practical issues related to data sources, there is the subject databases. Thus, we need to focus on diverse complex data types. We are storing different types of data in a variety of repositories. It is difficult to expect a data mining system to achieve good mining results on all kinds of data and sources.

As different kinds of data and sources may require distinct algorithms and methodologies. Currently, there is a focus on relational databases and data warehouses.

It’s a versatile data mining tool, for all sorts of data, may not be realistic. Moreover, data sources, at structural and semantic levels, poses important challenges. That is not only to the database community but also to the data mining community.

14. Applications of Data Mining

  • Weather forecasting.
  • E-commerce.
  • Self-driving cars.
  • Hazards of new medicine.
  • Space research.
  • Fraud detection.
  • Stock trade analysis.
  • Business forecasting.
  • Social networks.
  • Customers likelihood.

More applications inlcude:

  • A credit card company can leverage its vast warehouse of customer transaction data. As we perform this to identify customers. It shows more interest in a new credit product.
  • Moreover, we use small test mailing. So the attributes of customers with an affinity for the product have to identify. Recent projects have indicated more than a 20-fold decrease in costs. That is target for mailing campaigns over conventional approaches.
  • As diversified transportation company used to apply data mining. That is to identify the best prospects for its services. Further, need to apply this segmentation to a general business database. Such as those provided by Dun & Bradstreet can yield a prioritized list of prospects by region.
  • Large consumer packaged goods company. That can apply data mining to improve its sales process to retailers. Although, data from consumer panel, and competitor activity have to apply. That is to understand the reasons for brand and store switching.
  • Through this analysis, we have to manufacturer it. Then select promotional strategies that best reach their target customer segments.

15. Areas where Data Mining had Good and Bad Effects

a. Good Effects

  • Predict future trends, customer purchase habits
  • Help with decision making
  • Improve company revenue and lower costs
  • Market basket analysis
  • Fraud detection

b. Bad Effects

  • User privacy/security
  • Amount of data is overwhelming
  • Great cost at implementation stage
  • Possible misuse of information
  • Possible inaccuracy of data

16. Data Mining advantages and Disadvantages

Data mining advantages

  • To find probable defaulters, we use data mining in banks and financial institutions. This is done based on past transactions, user behavior and data patterns.
  • It helps advertisers to push right advertisements to the internet. That surfer on web pages based on machine learning algorithms. This way data mining benefit both possible buyers as well as sellers of the various products.
  • The retail malls and grocery stores peoples used data mining. That is to arrange and keep most sellable items in the most attentive positions. It has become possible due to inputs obtained from data mining software. This way data mining helps in increasing revenue.
  • As data mining is having different methods. That are cost-effective compare to other applications.
  • We use data mining in so many areas. Such as bio-informatics, medicine, genetics,etc.
  • We use data mining to identifying criminal suspects. That is by law enforcement agencies as mentioned above.

Data Mining disadvantages

  • Security: The time at which users are online for various uses, must be important. They do not have security systems in place to protect us.
  • As some of the data mining analytics use software. That is difficult to operate.Thus they require a user to have knowledge based training.
  • The techniques of data mining are not 100% accurate. Hence, it may cause serious consequences in certain conditions.

17. Conclusion

As a result, we have studied Data Mining introduction. Also, have studied about it’s all concepts. We have covered each and everything with pros-cans and applications. Furthermore, if you feel any query feel free to ask in a comment section.

--

--

Harshali Patel

Big Data Trainer at Dataflair Web Sevices Pvt. Ltd., blogger at https://data-flair.training/blogs/ and a technology freak. Knowledge sharing is my passion.