ML — Batch processing and streaming

jayan chathuranga
Techco
Published in
4 min readNov 2, 2016

Batch Processing

Batch processing is defined as a series of jobs that are connected to each other or executed one after another in sequence or in parallel. After executing this number of jobs, an output is generated and the information is consolidated to generate a final result. Input data is collected and put into lots (chunks) in a period of time and the output produced by each lot may be the “input” to the next batch. It is also called discrete data processing, and has the function to process a collection of large data files. Due to the high latency time between tasks, the fast response time is not a critical factor for the batch processing business.

Each task in the batch is associated with a “window” of time, and that window is set for the processing task. The window that is linked a priority for the task can be processed, usually in periods of less intensive system activity or extra-time schedules. In many cases, batch tasks are scheduled and run at predefined time intervals, such as at a certain time of day, month or year.

Some examples of tasks in batch processing:

  • Log analysis : this type of application or system (servers, OLTP software, etc.) logs are collected within a certain period of time (day, week or year) and the analysis, the data processing of the logs, is performed in a distinct time (the time window) to derive a number of key performance indicators for the system in question.
  • Billing applications: billing applications calculate the use of a service provided for a period of time and generate billing information, such as credit companies to produce billing statements at the end of each month.
  • Data Warehouses: the main goal of DW is consolidate management information as a static snapshot in function of the timeframe collected and aggregate views as weekly, monthly, quarterly reports, etc.

The above examples may adopt a perception that batch processing systems are not critical by nature, and this perception is wrong. Batch systems are critical to the business, although instant or real-time answers are not expected by its users. For example, a system of recommendations of product offerings to customers prospecting can be performed every night, creating calculations and storing the results in a database. The processing of such a system can become critical because it needs to be completed within a period of time (minutes or hours), and if not, your users would not have access to an updated set of offers and recommendations, and that would cause a bad impact on the business.

Streaming Processing

The real-time data processing continuously receives data that is under constant change, such as information relating to air traffic control, travel booking systems, social media updates and so on. The streaming processing should be fast enough so that exercises control over the ability to consume events from various data sources. The response time required for stream processing is instantaneous and should be processed in milliseconds (or microseconds). A stream processor is classified as real time only when it produces logically correct results within a time slice (in milliseconds) and can ensure the system requirements.Stream Processing often use the term latency is the time interval between a stimulus and a response, or the difference between the time at which the data was received and a response is generated. The lower latency, better flow, or throughout system.

The real-time data processing is also often referred to as near real time latency introduced due to the system or to relaxation of SLAs in producing the desired results. Here are some examples of real-time systems that receive data in near real time, process the data and send back the results:

  • ATMs bank: They receive input from users and instantly apply the transactions (withdrawals, transfers or any other transaction) to a centralized account.
  • Real-time monitoring: capture and analyze data issued by various sources such as sensors, news feeds, click on Web pages, etc.
  • Business Intelligence in real time: process that involves delivering business intelligence and decision-making information as operations are taking place.
  • Operational Intelligence uses data processing in real time and complex event processing (CEP) to extract value from operations from analysis of queries submitted against events that are introduced into the system.
  • Point of sale (POS) systems: perform update of stock, provides inventory history items allowing an organization to register payments in real time.
  • Assembly lines: process data in real time to reduce time, cost and mitigate errors. Errors are instantly captured and appropriate actions are taken without impacting the business by increasing the quality of products and business productivity.

Some of the complexities involved in streaming processing systems (real-time) :

  • Responsiveness of the system: expectations for real time processing systems revolve around its ability to process the data as they are introduced into the system (in order of micro or milliseconds) and not generate any delay in producing results.
  • Fault tolerance: failures can occur, but real-time processing systems can not afford to lose a single event.
  • Scalability: need to adopt a scale-out architecture, so that the growing demands of stream processing are met from the insertion of new nodes or computing resources without having to rebuild the entire environment.
  • Processing Memory: due to high latency, real-time systems can not tolerate writing processes/disk reading and data processing to be performed in the whole main memory. For this to occur, the systems need to ensure sufficient amount of memory to store ingested data entry into the distributed memory of the servers in a cluster.

Measuring data stream classification performance is a three dimensional problem involving processing speed, memory and accuracy. It is not possible to enforce and simultaneously measure all three at the same time so it is necessary to fix the one and then record the other two.

--

--