SPPU DSBDA Unit 3:-Big Data Analytics Life Cycle

Shantanu_khonde
6 min readApr 24, 2023

--

3.1 Introduction to Big Data.

Big Data can be defined as very large volumes(VOLUME) of data available at various sources(VARIETY) ,in varying degree of complexity and ,generated at different speed(VELOCITY).

Big data describes collection of data that is huge and is growing exponentially with time. big data processing begins with raw data that is unorganized. HADOOP is open source implementation of MapReduce used for big data processing.

3.1 .1Big Data Requirement.

big data requirements are classified based on 5 main characteristics;

  1. Volume.[data processing needs to be parallel across multiple sytem.]
  2. Velocity.[data processing should be at streaming speed.]
  3. Variety.[different Format ,type ,structure ,region.]
  4. Ambiguity.[data is ambiguous by nature.,i. e having multiple meanings. for example : & F may be male-female or Monday-Friday.]
  5. Complexity.[BD complexity needs to use many Algo to process data Quickly & efficiently]

3.1.2 Benefits & Challenges.

Benefits;

  1. improve services.
  2. improve business decision.
  3. reduce cost.
  4. risk identification.
  5. better efficiency.

Challenges.

  1. Existing data management solution have to cope up with 3 V’s
  2. not enough skilled data professionals.

3.1.3 Data (current) Analytical Architecture.

Analytics architecture refers to the systems, protocols, and technology used to collect, store, and analyze data. it focuses on multiple layer, starting with data warehouse architecture.

fig: Data Analytical Architecture.

When building analytics architecture, organizations need to consider both the hardware — how data will be physically stored — as well as the software that will be used to manage and process it.

Analytics architecture also focuses on multiple layers, starting with data warehouse architecture, which defines how users in an organization can access and interact with data. Storage is a key aspect of creating a reliable analytics process, as it will establish both how your data is organized, who can access it, and how quickly it can be referenced.

For the purpose of data sources to be loaded into the data warehouse , there is need that the data should be well understood , normalized with the suitable data type definitions and in structured format.

Although this kind of centralization enables security, backup, and failover of highly critical data, it also means that data typically must go through significant preprocessing and checkpoints before it can enter this sort of controlled environment, which does not lend itself to data exploration and iterative analytics.

As a result of this level of control on the EDW, additional local systems may emerge in the form of departmental warehouses and local data marts that business users create to accommodate their need for flexible analysis.

Once in the data warehouse, data is read by additional applications across the enterprise for BI and reporting purposes.

At the end of this workflow, analysts get data provisioned for their downstream analytics.

Many times these tools are limited to in-memory analytics on desktops analyzing samples of data, rather than the entire population of a datasets.

3.1.4 Big Data Ecosystem.

Big data ecosystem is comprehension of massive functional component with various enabling tools. organizations realize that new economy is emerging as data business which is rising.

this led to introduction of data vendors & data cleaners that use crowdsourcing to test outcomes of ML techniques.

As ecosystem is growing ,certain group of interest have been formed, which are as follow’s:

  1. Data Devices[data devices and sensor gather info, ex:- mobile network]
  2. Data Collectors[include samples entities that collect data ex:- retail shop customer foot tracking]
  3. Data Aggregator[entities which process data from first layer to last layer & make them understandable]
  4. Data user & buyer[final layer from ecosystem ]

3.2 Sources of Big Data

Machine data consist of information generated from industrial equipment, real-time data from sensors & web logs that track user behavior online.

A physics research center generate 40 terabyte of data every second during experiment. even B2B companies generate multitude of data on regular basis.

some examples of big data are: Social media, Stock Exchange, Aviation industry, Survey data etc.

3.2.1 Data Repository.

Data Repository is also known as data library or data archive. the data repository is large database infrastructure , several databases that collect , manage and store data sets for data analysis, sharing and reporting.

some of data repositories are;

  1. Spreadsheets.
  2. Enterprise Data Warehouse(EDWs).
  3. Analytic Sandbox.

some examples of data repositories are; Data warehouse, Data lake, Data mart, metadata repository, data cubes.

3.2.2 Analytical Sandbox.

An analytical sandbox is a testing environment that is used by data analysts and data scientists to experiment with data and explore various analytical approaches without affecting the production environment.

It is a separate, isolated environment that contains a copy of the production data, as well as the necessary tools and resources for data analysis and visualization.

fig: Analytical Sandbox Components.

Analytical Sandbox’s Essential Components Include:

  1. Business Analytics (Enterprise Analytics) — The self-service BI tools for situational analysis and discovery are part of business analytics.
  2. Analytical Sandbox Platform — The capabilities for processing, storing, and networking are provided by the analytical sandbox platform.
  3. Data Access and Delivery — Data collection and integration are made possible by data access and delivery from a number of data sources and data kinds.
  4. Data Sources — Big data (unstructured) and transactional data (structured) are two types of data sources that can come from both inside and outside of the company. Examples of these sources include extracts, feeds, messages, spreadsheets, and documents.

Importance of Analytical Sandbox;

  1. Data from various sources, both internal and external, both unstructured and structured, can be combined and filtered using analytical sandboxes.
  2. Data scientists can carry out complex analytics with the help of analytical sandboxes.
  3. Analytical sandboxes enable working with data initially.
  4. Analytical sandboxes make it possible to use high-performance computing while processing databases because the analytics takes place inside the database itself.

3.2.3 Factors responsible for data volume in big data.

  1. Machine data.
  2. Application log.
  3. Business process logs.
  4. Clickstream data.
  5. Third party data.
  6. Electronic mails.

3.3 Data Analytic Lifecycle.

fig: Data Analytical Life cycle.

Data Analytics Lifecycle :
The Data analytic lifecycle is designed for Big Data problems and data science projects. The cycle is iterative to represent real project. To address the distinct requirements for performing analysis on Big Data, step — by — step methodology is needed to organize the activities and tasks involved with acquiring, processing, analyzing, and repurposing data.

Phase 1: Discovery.

  • The data science team learn and investigate the problem.
  • Develop context and understanding.
  • Come to know about data sources needed and available for the project.
  • The team formulates initial hypothesis that can be later tested with data.

Phase 2: Data Preparation.

  • Steps to explore, preprocess, and condition data prior to modeling and analysis.
  • It requires the presence of an analytic sandbox, the team execute, load, and transform, to get data into the sandbox.
  • Data preparation tasks are likely to be performed multiple times and not in predefined order.
  • Several tools commonly used for this phase are — Hadoop, Alpine Miner, Open Refine, etc.

Phase 3: Model Planning.

  • Team explores data to learn about relationships between variables and subsequently, selects key variables and the most suitable models.
  • In this phase, data science team develop data sets for training, testing, and production purposes.
  • Team builds and executes models based on the work done in the model planning phase.
  • Several tools commonly used for this phase are — Matlab, STASTICA.

Phase 4: Model Building.

  • Team develops datasets for testing, training, and production purposes.
  • Team also considers whether its existing tools will suffice for running the models or if they need more robust environment for executing models.
  • Free or open-source tools — Rand PL/R, Octave, WEKA.
  • Commercial tools — Matlab , STASTICA.

Phase 5: Communication Result.

  • After executing model team need to compare outcomes of modeling to criteria established for success and failure.
  • Team considers how best to articulate findings and outcomes to various team members and stakeholders, taking into account warning, assumptions.
  • Team should identify key findings, quantify business value, and develop narrative to summarize and convey findings to stakeholders.

Phase 6: Operationalize.

  • The team communicates benefits of project more broadly and sets up pilot project to deploy work in controlled way before broadening the work to full enterprise of users.
  • This approach enables team to learn about performance and related constraints of the model in production environment on small scale , and make adjustments before full deployment.
  • The team delivers final reports, briefings, codes.
  • Free or open source tools — Octave, WEKA, SQL, MADlib.

--

--