Beam College 2024: Worth Your Time?
I recently attended Beam College for the second time, and I’m happy to share my thoughts. It’s worth noting that this review comes from a huge fan of Apache Beam. I might be a little biased, but I’m excited to share my experience! 😄
🎓 Beam College
Beam College is a free, hands-on training program designed to enhance data processing skills. It offers flexible workshops led by industry experts, focusing on Apache Beam. Participants learn everything from basic concepts to advanced use cases and best practices, gaining practical experience in building efficient data pipelines. The program aims to bridge the gap between theoretical knowledge and real-world application, providing valuable insights for both beginners and experienced data professionals looking to master Apache Beam.
You will find all sessions on Apache Beam Youtube channel.
Beam College 2024 offered 3 days of sessions (the schedule is linked here):
- Apache Beam Overview (July 23)
- Apache Beam for AI (July 24)
- Making the jump from batch to streaming (July 25)
🐝 Day 1: Apache Beam Overview
The first day of Beam College provided a comprehensive overview of Apache Beam. Sessions focused first on understanding the fundamentals of Apache Beam, exploring its unique features and how it differed from other tools in the data processing ecosystem. Participants learned to identify scenarios where Beam would be an ideal fit for their projects or organizations. The day concluded with practical instruction on getting started with Apache Beam, guiding attendees through the process of building their first pipeline. This blend of theoretical knowledge and hands-on experience set a solid foundation for the rest of the program.
About 120 people joined the first day of Beam College. It was a pleasure to see familiar faces from last year’s event alongside many new attendees, fostering a sense of community and collaboration. The day was filled with engaging presentations, and I’d like to highlight some of the most interesting ones:
- Marc Howard’s presentation, “Project Shield: Defending against DDoS with Beam”, showcased a critical application of Apache Beam and Google Cloud Dataflow in safeguarding global access to election information. As the founding engineer of Project Shield, a free service protecting vulnerable online content from DDoS attacks, Marc detailed how their system processed over 3 TB of daily log data, handling 10,000+ queries per second and scaling up to 400 million during major attacks. This impressive infrastructure not only powered real-time user analytics but also strengthened Project Shield’s defenses, playing a crucial role in maintaining internet freedom, particularly during democratic processes. The presentation highlighted an important and compelling use case for Beam, demonstrating its power in protecting free speech and information in the digital age.
- Jeff Kinard’s presentation, “Beam YAML Bootcamp: Effortless pipeline design for data processing”, generated significant interest and prompted numerous questions from the audience. Jeff introduced YAML as a simplified approach for expressing pipelines, highlighting its straightforward syntax and ease of use.
🤖 Day 2: Apache Beam for AI
On day 2, participants learned how they could utilize Apache Beam for implementing AI pipelines. In the first series of lessons, they implemented a machine learning pipeline all the way from conceptualization to coding and running it on a notebook. An additional session covered using Beam to interact with Google Gemini via Google AI Studio.
- “How to Implement a ML pipeline using Beam. Part 1: Concepts and defining our pipeline & Part 2: Coding” by Danny McCormick and Kerry Donny-Clark. This was a two-part session on implementing Machine Learning pipelines using Apache Beam. The first part introduced key concepts and pipeline definition, focusing on the RunInference transform and ModelHandler class. These tools facilitated easy model inference and adaptation from various ML frameworks. The second part demonstrated the practical application of these concepts, showcasing Beam code for a complex pipeline that processed voice input, classified text, applied different models based on classification, and converted text back to voice. The presenter illustrated how to map classification outputs to specific language models and execute the complete pipeline. This comprehensive session provided attendees with both theoretical knowledge and hands-on experience in integrating ML capabilities within Beam pipelines, using a practical voice-to-text-to-voice example throughout.
- Israel Herraiz conducted a session titled “Implementing a complex ML pipeline: Demo w/ Google AI Studios” where a demonstration was given on interacting with an AI model (Gemini was used in this instance) from Beam using RunInference from a notebook in Google Colab.
📡Day 3: Making the Jump from Batch to Streaming
The final day of Beam College concentrated on Apache Beam’s unified approach to batch and streaming pipelines. Participants explored key concepts for implementing streaming pipelines in Beam, followed by a practical demonstration. The sessions concluded with an introduction to Beam Quest, an advanced learning resource for mastering complex concepts in Apache Beam.
- Yi Hu’s session, “Making the Jump from Batch to Streaming: Concepts and code”, explored the transition from batch to streaming pipelines using Apache Beam. The talk covered fundamental concepts differentiating batch and streaming processing, introduced Beam primitives for streaming applications, and concluded with practical examples using Pub/Sub.
- Surjit Singh’s session, “CI/CD with Dataflow Templates”, provided a walkthrough of Dataflow’s capabilities for implementing continuous integration and continuous delivery (CI/CD) pipelines.
Beam College is a must-attend event for Apache Beam and Google Cloud enthusiasts, offering invaluable learning and networking opportunities that should not be missed. I highly recommend prioritizing it in your schedule for next year. This event significantly transformed my perspective on Beam. When I initially began learning Beam/Dataflow, I noticed a scarcity of resources and hands-on projects. Recognizing this gap, I took the initiative to create Apache Beam projects myself. If you’re interested in exploring these projects, you can find them here:
Batch processing:
- ☁️GCP Data Engineering Project: Building and Orchestrating an ETL Pipeline for Online Food Delivery Industry with Apache Beam and Apache Airflow🍕🚚
- ☁️GCP Data Engineering Project: Connect Four game with Python and Apache Beam 🔴⚫️
Streaming processing:
Feel free to connect with me on LinkedIn if you’d like to exchange ideas about Apache Beam and Google Cloud! 💬😊