BIG DATA — ANALYTICS (abstract / study)

Translation (portuguese to english) of


Above, a sample design layout drawing of a BIG DATA type computer network. This design was done using Trace-Route software used by the Cisco Networking Academy of Cisco Systems. This layout design was done by Bunyamin Önel shared this Cisco Packet Tracer scenario… (If you have ‪#‎PacketTracer‬ scenarios to share, send them over to with some info and we will share them on this page) — Anastasia at ‪#‎Cisco‬

Below is my summary of the BIG DATA — ANALYTICS study, removing a degree of technical complication, to simplify my understanding of this subject.


Statistical studies are related to situations involving planning, data collection, organization of information, analysis of information collected, interpretation and disclosure in a clear and objective manner. Research methods can be classified in two ways: opinion surveys or market research. In opinion polls, the main goal is to gather information about determining subject based on personal interviews. Market research is conducted through the market analysis of a particular product.

Collection, organization, description of data, calculation and interpretation of coefficients belong to Descriptive Statistics, while analysis and interpretation of data, associated with a margin of uncertainty, is the responsibility of the Inductive or Inferential Statistics, also called As the measure of uncertainty or methods that are based on probability theory.

The use of tables and graphs are frequent in Statistics. The tables serve to organize and tabulate the data, and the graphics convey the information with clarity and transparency, contributing to an objective reading.


DATA ANALYSIS — What is it?

It is the process by which order, structure and meaning are given to the data (information).

It consists in transforming the collected data into useful and true conclusions and / or lessons.

From the pre-established topics, the data are processed, looking for trends, differences and variations in the information obtained.

The processes, techniques and tools used are based on certain assumptions and as such have limitations.

The process is used to describe and summarize the data, identify the relationships and differences between variables, compare variables and make predictions.


Inquires why certain fact or problem is occurring;

Study the motivations;
It is inductive;
Helps to define hypotheses;
It is exploratory;
It allows to know trends, behaviors, attitudes, etc .;
Provides detailed information on trends, activities, etc;

Provides detailed information to questions or problems about a project or activity thereof;

It does not allow to infer the results to a whole population.

Describes a population sample using frequency tables with sex, age group and occupation.

Organize comments and responses in similar categories (example: concerns, suggestions, strengths, weaknesses, etc.)

Identify patterns, trends, relationships, as well as cause-effect associations.



Narratives of participants’ responses
Cause-Effect Diagrams
Diagram of relations of the various categories and their meaning given by participants


Identify the characteristics common to the work group, and the differences in relation to the other groups.
Inferring on the processes of socialization for the work of the co-workers and whether such processes would be associated in some way with the ways they are perceived at work today.

Identify common experiences, based on their cooperative insertion, and their impact on personal, family and social life


It studies the actions or interventions;

It is deductive;
Provides data to prove hypotheses;

It is conclusive;
Measures the level of interventions, trends, activities, etc .;

It produces quantifiable information about the magnitude of a problem but does not provide information on why the problem is occurring;
It is possible to infer the results to a whole population.

Statistical methods are used to represent the data (information)

Descriptive statistics involves: collecting data, presenting data and characterizing data, for the purpose of describing the data.

Inferential statistics involves: estimation and hypothesis testing, with the purpose of making decisions about the characteristics of a population from the sample.


Organize the data, taking care to give a logical order to the data, putting all the elements of the sample and the variables under study in a table.
Grouping and summary of data using frequency tables, eg age = (fi = ni / n)

Summary of the main statistics (variable, mean, median, fashion, standard deviation and variance)
Measures of central tendency
Measures of dispersion
Analyze and interpret data
Do a correlation analysis


Generally, data analysis and interpretation involves making comparisons of statistical quantities of the variables of interest.

The conclusions of these comparisons are based on the rejection or acceptance of hypotheses formulated during the evaluation questions.

The acceptance or rejection of hypotheses is based on the results obtained in the so-called statistical tests.
The most used tests are: T-student, Chi-Square and Anova

Baseline vs. Results Achieved: compares the situation before and after the implementation of the program.

Target Group versus Control Group: compares attitudes or practices between participants and non-participants in a program.

The treatment group — is a group of intervention participants and whose outcome measures are compared with those of a control group.

Control group — is a group of “untreated” targets that are compared with experimental groups in outcomes.

TOOLS FOR DATA ANALYSIS (information system)

Currently there are several technology companies that provide software for data analysis and treatment.

Big Data Frameworks and Platforms

The big data ecosystem includes many libraries and frameworks that interoperate with each other. Libraries usually provide solutions to specific problems; for instance, applying neural-network methods on your data. Frameworks integrate various libraries to provide even more functionality. Here are a few examples:

  • Frameworks: Hadoop Ecosystem, Apache Spark, Apache Storm, Apache Pig, Facebook Presto
  • Patterns: MapReduce, Actor Model, Data Pipeline
  • Platforms: Cloudera, Pivotal, Amazon Redshift, Hortonworks, IBM, Google Compute Engine

The Hadoop ecosystem is complemented and surrounded by many different tools. Some of them are covered in our series of courses, such as:

  • Apache Mahout: a scalable machine learning and data mining library
  • Apache Pig: a high-level data-flow language and execution framework for parallel computation
  • Apache Spark: a fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including Extract, Transform and Load (ETL), machine learning, stream processing, and graph computation.

Where to go for information

The company website is a good place to start when you need detailed information about a particular tool. The websites we’ve listed below provide user documentation and other sources of support. We’ve also compiled a list of recommended books for you. These are books you’re likely to find on a big data ninja’s bookshelf that you might like to borrow from your library.



  1. Gregory Piatetsky-Shapiro (Analytics, Data Mining, Data Science Expert, KDnuggets President) com exemplos de Fórmula de Cálculo em Which Big Data, Data Mining, and Data Science Tools go together?
  2. Kirk Borne (Principal Data Scientist at Booz Allen Hamilton) em With Prescriptive Analytics, the future ain’t what it used to be
  3. Kirk Borne ‏@KirkDBorne (Principal Data Scientist at Booz Allen Hamilton) em 50+ Free #DataScience Books: #abdsc #BigData#MachineLearning #DataMining
  4. Bernard Marr @BernardMarr (Advanced Performance Institute) em Big Data Possibilities — What is Big Data?
  5. Bernard Marr @BernardMarr (Advanced Performance Institute) em Supervised V Unsupervised Machine Learning — What’s The Difference? via @forbes
  6. Ronald van Loon ‏@Ronald_vanLoon (Top10 Influencer #BigData #DataScience #IoT #Analytics) em My Brief Guide to Big Data and Predictive Analytics for non-experts | #BigData#PredictiveA
  7. Ronald van Loon‏ @Ronald_vanLoon (Top10 Influencer #BigData #DataScience #IoT #Analytics) em 12 Statistical and Machine Learning Methods that Every Data Scientist Should Know | #MachineLearning #DataScientis …
  8. Evan Sinar, PhD‏ @EvanSinar (Chief Scientist & VP at Development Dimensions International (@DDIworld); Author & Top Influencer on #leadership | #dataviz | HR #analytics | #bigdata | #iot) em Awesome tutorial from @rnlanders on Natural Language Processing — check it out! #nlp #analytics #datascience
  9. Evan Sinar, PhD‏ @EvanSinar (Chief Scientist & VP at Development Dimensions International (@DDIworld); Author & Top Influencer on #leadership | #dataviz | HR #analytics | #bigdata | #iot) em The seven deadly sins of statistical misinterpretation, and how to avoid them @DataScienceCtrl #datascience

Bibliography Recommendation Data Science = Gregory Piatetsky-Shapiro (Analytics, Data Mining, Data Science Expert, KDnuggets President) em More Free Data Mining, Data Science Books and Resources

The list below based on the list compiled by Pedro Martins, but we added the book authors and year, sorted alphabetically by title, fixed spelling, and removed the links that did not work.

  1. An Introduction to Data Science by Jeffrey Stanton, Robert De Graaf, 2013.
    An introductory level resource developed by Syracuse University
  2. An Introduction to Statistical Learning: with Applications in R by G. Casella, S, Fienberg, I Olkin, 2013.
    Overview of statistical learning based on large datasets of information. The exploratory techniques of the data are discussed using the R programming language.
  3. A Programmer’s Guide to Data Mining by Ron Zacharski, 2012.
    A guide through data mining concepts in a programming point of view. It provides several hands-on problems to practice and test the subjects taught on this online book.
  4. Bayesian Reasoning and Machine Learning by David Barber, 2012.
    focusing on applying it to machine learning algorithms and processes. It is a hands-on resource, great to absorb all the knowledge in the book.
  5. Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners by Jared Dean, 2014.
    On this resource the reality of big data is explored, and its benefits, from the marketing point of view. It also explains how to storage these kind of data and algorithms to process it, based on data mining and machine learning.
  6. Data Mining and Analysis: Fundamental Concepts and Algorithms by Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, May 2014.
    A great cover of the data mining exploratory algorithms and machine learning processes. These explanations are complemented by some statistical analysis.
  7. Data Mining and Business Analytics with R by Johannes Ledolter, 2013.
    Another R based book describing all processes and implementations to explore, transform and store information. It also focus on the concept of Business Analytics.
  8. Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management by Michael J.A. Berry, Gordon S. Linoff, 2004.
    A data mining book oriented specifically to marketing and business management. With great case studies in order to understand how to apply these techniques on the real world.
  9. Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery by Graham Williams, 2011.
    The objective of this book is to provide you lots of information on data manipulation. It focus on the Rattle toolkit and the R language to demonstrate the implementation of these techniques.
  10. Gaussian Processes for Machine Learning by Carl Edward Rasmussen and Christopher K. I. Williams, 2006.
    This is a theoretical book approaching learning algorithms based on probabilistic Gaussian processes. It’s about supervised learning problems, describing models and solutions related to machine learning.

Read the full post on KDnuggets:

Gregory Piatetsky-Shapiro (Analytics, Data Mining, Data Science Expert, KDnuggets President)


Recomendação de Bibliografia Data Science = Kirk Borne ‏@KirkDBorne Download 50+ Free #DataScience Books: #abdsc #BigData #Analytics

Very interesting compilation published here, with a strong machine learning flavor (maybe machine learning book authors — usually academics — are more prone to making their books available for free). Many are O’Reilly books freely available. Here we display those most relevant to data science. I haven’t checked all the sources, but they seem legit. If you find some issue, let us know in the comment section below. Note that at DSC, we also have our free books:

There are several sections in the listing in question:

  1. Data Science Overviews (4 books)
  2. Data Scientists Interviews (2 books)
  3. How To Build Data Science Teams (3 books)
  4. Data Analysis (1 book)
  5. Distributed Computing Tools (2 books)
  6. Data Mining and Machine Learning (29 books)
  7. Statistics and Statistical Learning (5 books)
  8. Data Visualization (2 books)
  9. Big Data (3 books)

Here we mention #1, #5 and #6:

Data Science Overviews

Distributed Computing Tools

Data Mining and Machine Learning


The information management big data and analytics capabilities include :

  • Data Management & Warehouse: Gain industry-leading database performance across multiple workloads while lowering administration, storage, development and server costs; Realize extreme speed with capabilities optimized for analytics workloads such as deep analytics, and benefit from workload-optimized systems that can be up and running in hours.
  • Hadoop System: Bring the power of Apache Hadoop to the enterprise with application accelerators, analytics, visualization, development tools, performance and security features.
  • Stream Computing: Efficiently deliver real-time analytic processing on constantly changing data in motion and enable descriptive and predictive analytics to support real-time decisions. Capture and analyze all data, all the time, just in time. With stream computing, store less, analyze more and make better decisions faster.
  • Content Management: Enable comprehensive content lifecycle and document management with cost-effective control of existing and new types of content with scale, security and stability.
  • Information Integration & Governance: Build confidence in big data with the ability to integrate, understand, manage and govern data appropriately across its lifecycle.

Fonte: IBM.COM =


BIG DATA — DataBase Definição

BIG DATA — SQL (manutenção anual)

BIG DATA — Data Science

Por: ANA MERCEDES GAUNA (13/10/2015) |

Senior System Analyst | Webmaster | MCSE | MCDBA | CCNA2

Rio de Janeiro/RJ — Brazil — 27 anos de experiência profissional (CLT)