Ten Tips To Turbo Charge Your Streaming

Tim Spann
12 min readDec 19, 2023

--

Apache NiFi, Apache Kafka, Apache Flink

Apache NiFi Tips

Tip 1 — Always use the latest version and setup automated back up.

Lots of issues melt away once you are running the latest Apache NiFi, especially Cloudera’s which always has the latest patches. You also want to make sure you are using an updated Operating System, enough RAM and CPU and run the latest possible JDK. JDK 17 with NiFi 1.23 is great. For NiFi 2.0.0-M1 you need JDK 21.

Tip 2 — Run Isolated on Kubernetes — One Flow at A Time

The completed automated and future version running on Cloudera Public Cloud automates and makes everything super easy with monitoring or ReadyFlows.

See the list of megatipes here:

Tip 3 — No Spaghetti Flows

I am Italian, I love pasta. That said, don’t code like pasta. This goes with tip 2 on running less per cluster. Also take a look at Stateless NiFi runner for “job” / “FaaS” style flows. If you have too many flows or too many steps break things into Process Groups, distribute to other clusters or run on demand.

Data in Motion.dev

Apache Kafka Tips

Tip 4 — One topic, One Schema, One Idea

I know we don’t have millions of topics like Pulsar, but keep a topic to one thing. Having different data types (AVRO, JSON) or schemas in one topic to overload and use it many ways is problematic. Many tools need to examine data to get schemas for SQL like Flink, NiFi and others. You also don’t want to overload what’s going in there and have consumers surprised. Keep your data governed, clean and on topic. Some people create a superset schema and then have a lot of sparse data. I am not sure if that’s a good idea, but maybe if you have thousands of similar files and they are small even if they only ever have 20 of the 256 possible fields. Watch out for weird side effects from Nulls.

Tip 5 — Experiment and study

The performance and setup of your Kafka cluster to support each use case is critical. Before you just throw some data into a Topic, measure for at least a week. Many things can be streaming sometimes, bursts or stopped. Get at least a full week (longer is better) to see average size of records, consistency of data, average/minimum/maximum data loads, odd cases and weird problems. I have seen a 5 records a second system burst to 100,000 records a second for one hour a day. I have seen some stream thousands of records a second but only Monday to Friday from 9 to 5pm EST. Make sure you know the datatypes, data time at producer, time zones and other i18N items. Rebuilding Kafka clusters or setups because no one checked the real sizing is a problem.

Tip 6 — Should it be in Kafka

Can it be completed with NiFi or basic Java or Python code at the time of data creation? Are other consumers of this data going to exist? If there is a complex pipeline with many downstream consumers then definitely. CDC doesn’t even always require Kafka. Debezium can feed change data records right to Flink or NiFi.

Kafka is an awesome buffer and let’s us receive data asynchronously and loosely coupled. These are great things if you need them. If you are in a hybrid cloud or multicloud environment then Kafka replication is great. If you are doing local one node processing or running on your laptop then you probably don’t need it. Keep your setup simple. I can run an entire country’s transit flow on just one NiFi node.

Apache Flink Tips

Tip 7 — The Future is Fast Big Data

It’s time to start looking at Apache Paimon and Apache Flink with Apache Iceberg.

Tip 8 — SQL First, Last and Always

You should try to build all of your applications in Flink first with SQL. When you are ready for production look at SQL again. And finally just do SQL unless it’s really something that doesn’t fix. You can also embed your SQL inside of your Flink Java applications, so it’s a no brainer for most use cases.

General Tips

Tip 9 — Isolate, Test, Run Local, Repeat

Don’t be afraid to test in local, Docker, Docker Compose or minikube. Test and automate before you deploy to any production.

Final Tip 10 — Run With Security on Full Always

So many forgotten test instances left in Amazon have become bots and miners. Fortunately the latest Apache NiFi runs with SSL and passwords mandated from the start. Always up the level of security and run with SSL always. Docker and all the developer editions support this. If you don’t develop locally with security when you move to production something can fall through the cracks.

Encrypt everywhere you can, use good passwords, use secure password managers and enable all the security that Apache Ranger and your data platforms allow.

REFERENCES

And now for ChatGPT’s NiFi Tips.

Certainly! Here’s a top ten tips list for working with Apache NiFi:

1. Understand the Flow Design:
NiFi operates on a flow-based model, where data is represented as flow files that move through a directed graph of processors. Understand the flow design paradigm and design your data flows with a clear understanding of how data will move through your system.

2. Start with Simple Processors:
Begin by using simple processors to get familiar with NiFi. Gradually introduce more complex processors and features as needed. NiFi’s extensive set of processors provides flexibility, but starting simple helps in building a solid foundation.

3. Effective Use of Connection Queues:
NiFi uses connection queues between processors. Pay attention to the settings of these queues, especially when dealing with backpressure. Adjust queue size and flow control settings to handle data efficiently and avoid bottlenecks.

4. Security Best Practices:
Implement security measures such as SSL/TLS encryption, access controls, and user authentication to protect sensitive data. Regularly review and update security configurations to align with best practices and organizational policies. (Tim point — Cloudera Public Cloud DataFlow does this for you)

5. Monitor and Manage Resources:
Monitor the resource utilization of your NiFi instance using NiFi’s web-based user interface or external tools. Adjust configuration settings, such as thread pools and buffer sizes, based on your system’s performance characteristics to ensure optimal resource utilization. (Tim point — Cloudera Public Cloud DataFlow does this for you)

6. Version Control and Backups:
Implement version control for your NiFi flows using tools like Git. This helps in tracking changes, collaborating with team members, and rolling back to previous versions if needed. Additionally, regularly back up your NiFi configuration files for quick recovery in case of failures. (Tim point — use NiFI CLI and REST for DevOps automation. Or use Cloudera Public Cloud DataFlow for magic)

7. Use Templates for Reusability:
Leverage NiFi templates to encapsulate and reuse data flow components. Templates simplify the process of sharing and deploying data flows across different environments, making it easier to maintain consistency and manage changes. (Tim point — Use JSON, not XML which is deprecated)

8. Explore NiFi Registry:
NiFi Registry allows you to manage, version, and deploy flows across different NiFi instances. Explore NiFi Registry for enhanced flow versioning, collaboration, and deployment management. It provides additional capabilities for maintaining flow version history and promoting flows between environments. (Tim point — you can export and backup to git right from here automagically)

9. Custom Processor Development:
If your use case requires custom functionality, consider developing custom processors. NiFi provides a software development kit (SDK) for building custom processors. Develop, test, and deploy custom processors to extend NiFi’s capabilities according to your specific requirements. (Tim point — definitely check out mine https://github.com/tspannhw?tab=repositories&q=nifi-+processor&type=source&language=&sort=)

10. Community Engagement and Documentation:
Stay engaged with the NiFi community through forums, mailing lists, and other channels. NiFi’s community is active and supportive, providing valuable insights and assistance. Regularly refer to the official NiFi documentation for in-depth guidance on features, configurations, and best practices. (Tim point — check out Cloudera Community).

Remember that these tips are general guidelines, and your specific use case may require additional considerations. Keep up with the NiFi community and updates to stay informed about new features and improvements in Apache NiFi.

And now for ChatGPT’s Kafka Tips.

Apache Kafka is a distributed event streaming platform that is widely used for building real-time data pipelines and streaming applications. Here are some tips for working with Kafka:

1. Understanding Kafka Basics:
— Familiarize yourself with Kafka’s core concepts, such as topics, partitions, brokers, producers, consumers, and consumer groups.

2. Cluster Configuration:
— Ensure that your Kafka cluster is properly configured with an appropriate number of brokers, replication factor, and partition count based on your use case and scalability requirements.(Tim point — probably best to have Cloudera host).

3. Topic Design:
— Plan your topic design carefully. Topics are the fundamental units of parallelism in Kafka, and the number of partitions in a topic determines the parallelism of your system.

4. Data Retention:
— Set appropriate data retention policies for your topics based on your storage and compliance requirements. Be mindful of the disk space consumed by Kafka logs. (Tim point — with tiered storage this will change).

5. Monitoring:
— Implement robust monitoring using tools like Prometheus, Grafana, or Kafka Manager to keep track of key Kafka metrics such as broker health, topic lag, and consumer lag. (Tim point —Cloudera Streams Messaging Manager will do this for you plus alerts).

6. Consumer Group Lag:
— Monitor consumer lag to ensure that consumers are keeping up with the production rate. Lag indicates the difference between the latest offset in a partition and the offset that the consumer has processed.(Tim point — Cloudera Streams Messaging Manager will do this for you plus alerts).

7. Message Serialization:
— Choose an efficient serialization format for your messages. Avro, JSON, or Protobuf are common choices. Consider the trade-offs between human readability, schema evolution, and serialization/deserialization performance. (Tim point — JSON if you have no schema registry. Avro for most cases with Spring, Python, Flink, NiFi and Spark as consumers).

8. Producer Configuration:
— Tweak producer configurations such as batch size, linger time, and compression settings based on your specific use case to optimize performance and resource utilization. (Tim point — https://docs.cloudera.com/runtime/7.2.0/kafka-developing-applications/topics/kafka-develop-producers.html)

9. Consumer Configuration:
— Adjust consumer configurations, including parameters like `fetch.min.bytes`, `fetch.max.wait.ms`, and `max.poll.records` to fine-tune the balance between throughput and latency.

10. Fault Tolerance:
— Ensure high availability and fault tolerance by configuring replication appropriately. Understand how Kafka handles leader election and replication to prevent data loss in case of broker failures.

11. Security:
— Implement proper security measures, including SSL/TLS encryption, SASL authentication, and proper access controls to secure your Kafka cluster.

12. Upgrading Kafka:
— Before upgrading Kafka, carefully review the release notes and thoroughly test the new version in a non-production environment to identify any potential issues. (Tim point — time to think about KRaft and leave your Zookeeper behind)

13. Schema Registry:
— If using a schema registry, ensure that schemas are properly managed and evolve gracefully to maintain compatibility with older data and consumers. (Tim point —Cloudera, Aiven and many others have open source schema registries, choose wisely or use Pulsar)

14. Kafka Streams:
— If using Kafka Streams for stream processing, understand the processing semantics, and optimize your application for performance by adjusting configuration parameters and considering state store requirements. (Tim point — where are you hosting and running your KS apps?)

15. Documentation and Community:
— Refer to Kafka’s official documentation regularly, and participate in the Kafka community to stay informed about best practices, updates, and common issues.

Remember that the effectiveness of these tips depends on your specific use case and requirements. Always test changes in a controlled environment before applying them to a production Kafka cluster.

And now for ChatGPT’s Flink Tips.

Apache Flink is a powerful and flexible stream processing framework for big data processing and analytics. Here are some tips for working with Apache Flink:

1. Understand Flink’s Core Concepts:
— Familiarize yourself with key concepts like streams, transformations, windows, event time processing, state, and checkpoints.

2. Cluster Configuration:
— Configure your Flink cluster according to your requirements, considering factors like parallelism, task managers, memory settings, and high availability. (Tim point — probably best to have Cloudera host).

3. Checkpointing:
— Implement and configure checkpointing to provide fault tolerance and exactly-once processing semantics. Adjust the checkpoint interval based on the latency and throughput requirements of your application.

4. Event Time Processing:
— Use event time processing for handling out-of-order events and ensuring accurate windowing. Configure watermarks appropriately to progress event time.

5. State Management:
— Understand Flink’s state management and choose the appropriate state backend (e.g., RocksDB, MemoryStateBackend) based on your performance and durability requirements.

6. Tuning Parallelism:
— Adjust the parallelism of your Flink job components (operators, sources, sinks) to achieve optimal performance. Consider the available resources and the nature of your processing logic.

7. Optimize Stateful Operations:
— Be mindful of stateful operations and their impact on performance. Optimize stateful operations by partitioning state, using keyed state efficiently, and leveraging Flink’s state TTL (Time-To-Live) feature.

8. Flink UI and Metrics:
— Monitor your Flink application using the Flink Web UI and metrics. Pay attention to key metrics such as checkpoint duration, throughput, and resource utilization to identify bottlenecks and performance issues.

9. Flink Savepoints:
— Understand how to use Flink savepoints for managing state snapshots and facilitating updates or restarts of your Flink applications without losing state.

10. Dynamic Scaling:
— Leverage Flink’s dynamic scaling capabilities to adjust the parallelism of your job while it’s running based on the incoming workload.

11. Fault Tolerance:
— Test your Flink application’s fault tolerance by simulating failures and ensuring that checkpoints are working as expected. Be familiar with Flink’s recovery mechanisms.

12. Table API and SQL:
— If applicable, consider using Flink’s Table API or SQL for a more declarative and SQL-like approach to define your processing logic. (Tim point —Flink SQL is awesome, use Cloudera SQL Stream Builder).

13. Connectivity to External Systems:
— Use Flink connectors to integrate with external systems such as Kafka, Hadoop, Elasticsearch, and others. Be aware of the capabilities and limitations of each connector. (Tim point —Kafka, Kudu, Hive, Iceberg, JDBC database are all great connectors).

14. Version Compatibility:
— Before upgrading Flink, review the release notes and ensure that your Flink job is compatible with the new version. Perform thorough testing in a controlled environment.

15. Community and Documentation:
— Engage with the Flink community, ask questions, and refer to the official documentation. The Flink community is active and can provide valuable insights and assistance.

Remember to test your Flink applications thoroughly in a staging environment before deploying them to production, and consider the specific requirements and characteristics of your use case when applying these tips.

--

--

Tim Spann

Principal Developer Advocate, Zilliz. Milvus, Attu, Towhee, GenAI, Big Data, IoT, Deep Learning, Streaming, Machine Learning. https://www.datainmotion.dev/