Big DATA in Data Science

Chanaka
6 min read1 day ago

What is Big Data?

Big data is a term used to describe the massive volumes of digital data generated, collected, and processed.

What is the meaning of the word BIG in Big Data?

The term big data describes data that is either

  • moving too quickly,
  • is simply too large,
  • or is too complex to be stored, processed, or analyzed with traditional data storage and analytics applications.

Size is only one of the characteristics that define big data. Other criteria include the speed of generated data and the variety of data collected and stored.

What are the examples for Big Data?

  • data generated by postings to social media accounts, such as Facebook and Twitter
  • ratings given to products on e-commerce sites like Amazon marketplace.

What are the characteristics of Big Data? (4Vs)

  1. Volume: amount of data transported and stored. International Data Corporation (IDC) experts predict data volume will increase at a compound annual growth rate of 23% over the next five years.
  2. Variety: many forms data can take, most of which are rarely in a ready state for processing and analysis. A significant contributor to big data is unstructured data, such as video, images and text documents, which are estimated to represent 80 to 90% of the world’s data.
  3. Velocity: rate at which this data is generated. For example, New York Stock Exchange generated data by a billion sold shares cannot just be stored for later analysis. It must be analyzed and reported immediately.
  4. Veracity: process of preventing inaccurate data from spoiling your data sets. For example, when people sign up for an online account, they often use false contact information.

Factors that has affected to the growth of Big Data

  • the proliferation of Internet of Things (IoT) devices,
  • increased internet access, greater access to broadband,
  • use of smartphones, and
  • the popularity of social media.

Benefits of Big Data

  1. Healthcare
  • Improved diagnosis and treatment: By analyzing large datasets of patient medical records, doctors can identify patterns and trends that can help them diagnose diseases more accurately and develop more effective treatment plans.
  • Precision medicine: Big data allows for the development of personalized treatment plans that are tailored to each individual patient’s unique genetic makeup and medical history.
  • Drug discovery and development: Big data can be used to analyze large amounts of data from clinical trials and other sources to identify new drug targets and develop new drugs more quickly and efficiently.

2. Retail

  • Enhanced customer experience: Big data can be used to personalize the shopping experience for customers by recommending products that they are likely to be interested in and tailoring marketing campaigns to their specific needs.
  • Improved supply chain management: Big data can be used to optimize supply chains by predicting demand for products and ensuring that there is enough inventory on hand to meet customer needs.
  • Reduced fraud: Big data can be used to identify patterns of fraudulent activity and prevent fraud from occurring.

3. Finance

  • Improved risk management: Big data can be used to assess the risk of defaults on loans and other financial instruments, which can help financial institutions make more informed lending decisions.
  • Fraud detection: Big data can be used to identify patterns of fraudulent activity and prevent fraud from occurring.
  • Personalized financial products and services: Big data can be used to develop personalized financial products and services that meet the specific needs of individual customers.

4. Education

  • Personalized learning: Big data can be used to personalize the learning experience for students by identifying their strengths and weaknesses and tailoring instruction accordingly.
  • Improved teacher effectiveness: Big data can be used to track student progress and identify areas where teachers can improve their effectiveness.
  • Reduced dropout rates: Big data can be used to identify students who are at risk of dropping out of school and provide them with the support they need to stay on track.

Big Data Management

  • Managing all of this data to achieve these potential benefits requires handling the data. Data engineers are the professionals responsible for this management.
  • This process involves developing infrastructure and systems to ingest the data, clean and transform it, and ultimately store it in ways that make it easy for the rest of the organization to access and query the data to answer business questions.
  • This is where the “data pipelines” are comes in to play.

Data Pipeline

NOTE: Data Pipeline is somewhat equivalent to the ETL Process.

A data pipeline is a method in which raw data is ingested from various data sources, transformed and then ported to a data store, such as a data lake or data warehouse, for analysis. Before data flows into a data repository, it usually undergoes some data processing.

3 phases of a Data Pipeline

Copyrights: Cisco Skills For All
  1. Ingestion:

Taking data as batches from servers or databases (batch ingestion) and real-time events happening in the world and streaming from devices (streaming ingestion).

2. Transformation:

  • Data often needs to be cleaned up: missing values, dates can be in the wrong format.
  • Data quickly gets outdated: you might have gathered data on individuals who have changed roles or companies.

3. Storage:

Data needs to be stored in places and forms that make it easy for analysts to run reports on weekly sales and for data scientists to create predictive recommendation models. There are two primary locations for businesses to store their data: on-premises or in the cloud. Often, companies use a hybrid of both.

Popular tools used in Big Data

  • Storage: Hadoop is a popular framework for storing large amounts of data across clusters of computers. Other distributed databases like Cassandra are used for datasets that change frequently.
  • Apache Hadoop: https://hadoop.apache.org/
  • Processing: Apache Spark is a powerful tool for real-time processing and analysis of large datasets. Tools like Apache Flink and Apache Storm are also used for real-time data processing.
  • Apache spark: https://spark.apache.org/
  • Integration and Management: Talend Open Studio is an open-source platform for data integration and management. Tools in this category help move data between different systems and keep it organized.
  • Talend: https://www.talend.com/products/talend-open-studio/
  • Analytics and Visualization: Splunk is a popular tool for analyzing machine-generated data and generating reports and dashboards. Business intelligence (BI) tools like Looker help users make sense of big data and share insights with others.
  • Splunk: https://www.splunk.com/

References

--

--

Chanaka

UX Designer 🎨 | Software Engineer 💻 | Data Scientist 📊