Stories by Bargav Jagatha on Medium

Distributed Training: Pipeline Parallelism

Bargav Jagatha — Wed, 19 Mar 2025 07:00:08 GMT

When training very large models, we often run into memory limits on a single GPU. Model parallelism helps us overcome these limits by splitting the model across multiple GPUs. There are two main types:

Tensor Parallelism

Here, each large tensor (for example, a weight matrix) is split into slices and distributed across GPUs. Each GPU holds only a slice of every large tensor. During operations like matrix multiplication, the GPUs work on their slice and then collaborate to aggregate the full result.

Pipeline Parallelism

In contrast, pipeline parallelism splits the model vertically, assigning entire groups of layers to different GPUs. For example, in a simple 4-layer model:

output = L4(L3(L2(L1(input))))

We can assign layers L1 and L2 to GPU0 and layers L3 and L4 to GPU1. The forward pass flows as follows:

GPU0: Computes intermediate = L2(L1(input))

GPU1: Receives intermediate, computes output = L4(L3(intermediate))

During backpropagation, gradients from GPU1 are sent back to GPU0 so that each layer gets its correct gradient.

The Challenge with Naive Implementation

In a naive model parallel setup, only one GPU is active at a time:

Low GPU Utilization:

While GPU0 is busy processing its layers, GPU1 is idle waiting for the output to be transferred. As more GPUs are added, each device might only be active a small fraction of the time.

Communication Overhead:

Every time data moves from one GPU to another (e.g., from GPU0 to GPU1), a transfer occurs. On a single machine these transfers are relatively fast, but if the GPUs are on different machines, the overhead can significantly slow down training.

Imagine training on four GPUs: with naive model parallelism, each GPU might only be busy about 25% of the time (ignoring transfer times), which is not very efficient.

Enter GPipe: Smarter Pipeline Parallelism

GPipe addresses these inefficiencies by splitting each mini-batch into smaller micro-batches. Instead of waiting for an entire batch to be processed layer-by-layer, each micro-batch can be pipelined through the layers concurrently. Here’s how it works:

1. Micro-batching:

The original batch is divided into several micro-batches. For example, if you set chunks=4, a mini-batch is split into 4 micro-batches.

2. Pipeline Scheduling:

• Forward Pass:

Each micro-batch flows through the layers in a staggered fashion. While GPU0 is processing micro-batch 1 on its assigned layers, GPU1 might already be processing micro-batch 0 on its part of the model.

• Backward Pass:

After the forward pass, gradients are computed in reverse order. As soon as a GPU finishes processing a micro-batch, it can start its backward computation, even if the rest of the micro-batches are still in progress.

This interleaving of computation and communication greatly reduces idle time. GPUs are busy processing different micro-batches simultaneously, which increases overall utilization and speeds up training.

In the diagram above, note how the bubbles — representing idle periods — are minimized compared to the naive approach.

With a pipeline parallelism degree of 4 (4 GPUs), each GPU handles multiple micro-batches in an overlapping manner: first processing several forward passes and then, as work on other GPUs completes, beginning the backward passes.

For example, GPU0 performs the same forward path on chunk 0, 1, 2 and 3 (F0,0, F0,1, F0,2, F0,3) and then it waits for other GPUs to do their work and only when their work is starting to be complete, GPU0 starts to work again doing the backward path for chunks 3, 2, 1 and 0 (B0,3, B0,2, B0,1, B0,0).

With chunks=1 you end up with the naive MP, which is very inefficient. With a very large chunks value you end up with tiny micro-batch sizes which could be not every efficient either. So one has to experiment to find the value that leads to the highest efficient utilization of the gpus.

While the diagram shows that there is a bubble of “dead” time that can’t be parallelized because the last forward stage has to wait for backward to complete the pipeline, the purpose of finding the best value for chunks is to enable a high concurrent GPU utilization across all participating GPUs which translates to minimizing the size of the bubble.

Wrapping Up

Now that we have understand the common terminology and goals of Pipeline Parallel, Its worth noting that there are several possible ways of scheduling forward and backward microbatches across devices, and each approach offers different tradeoffs between pipeline bubble size, amount of communication, and memory footprint.

In future posts, we’ll delve into advanced pipeline scheduling strategies and discuss how they further improve performance and scalability.

Thanks for reading :)

References

3D Human Pose Estimation using LSTM and Transformer based models

Bargav Jagatha — Tue, 31 Dec 2024 06:08:34 GMT

Ever wondered how computers can understand and track human movement in 3D space? I recently developed a system that does exactly that, combining the power of LSTM networks and Transformers to create accurate 3D pose estimates from video footage, using 2D to 3D Lifting Approach.

The Challenge of Understanding Human Movement

Tracking human movement in 3D space is a complex problem that has applications ranging from animation to medical analysis. While 2D pose estimation has made significant strides, accurately predicting 3D poses brings additional challenges:

Depth perception from 2D videos
Handling occlusions and self-occlusions
Maintaining temporal consistency
Processing long sequences efficiently

Monocular 3D Pose Estimation

Our Approach: Comparing Classic and Modern Architecture

Since this is 2D to 3D Lifiting is a sequential problem, We used following neural network architectures as our models:

LSTM Networks: Perfect for understanding sequential data and temporal relationships in movement
Transformer Models: Excellent at capturing long-range dependencies and parallel processing

They both achieved an impressive MPJPE (Mean Per Joint Position Error) of:

55mm with our LSTM-based model
64mm with our Transformer-based approach

To put this in perspective, the current state-of-the-art achieves around 30mm — showing that our implementation provides robust performance while remaining accessible and adaptable.

Building on Giants’ Shoulders

Our work builds upon several outstanding projects in the field:

PoseFormerV2 by QitaoZhao
VideoPose3D by Facebook Research
3D Pose Baseline

By incorporating insights from these projects and adding our own innovations, we’ve created a flexible framework that researchers and developers can easily adapt to their needs.

Real-World Applications

This technology has practical applications across multiple fields:

Animation: Creating realistic character movements
Sports Analysis: Studying athlete performance
Medical Assessment: Tracking patient movement patterns
Human-Computer Interaction: Building more intuitive interfaces

Technical Implementation

We trained our model on the Human3.6M dataset, processing videos in windows of 81 frames to capture complex motion patterns. The system processes these sequences through:

Initial pose detection using YOLOv3 and HRNet
Temporal modeling with our model architectures
Final 3D pose estimation

Future Directions

Reducing the computational requirements
Improving real-time performance
Extending to multi-person scenarios
Handling more challenging viewpoints

Join the Journey

This project represents a step forward in making 3D pose estimation more accessible to researchers and developers. Whether you’re interested in computer vision, deep learning, or practical applications of AI, there’s something here for you to explore and build upon.

Want to learn more or contribute to the project? Check out our GitHub repository or reach out to discuss potential collaborations!

GitHub - bargav25/3D-Human-Pose-Estimation

The code and pretrained models are available on GitHub, along with detailed setup instructions and documentation.

KeypointNeRF: A New Approach to 3D Motion Capture Using Neural Radiance Fields

Bargav Jagatha — Tue, 31 Dec 2024 05:45:17 GMT

Overall Pipeline

Ever wondered how we could capture the intricate movements of animals in 3D without complex multi-camera setups or pre-defined skeletal models? That’s exactly what we tackled in my directed study, developing KeypointNeRF — a novel approach that’s changing how we think about 3D motion capture.

Comparing rendered images

The Challenge: Why Traditional Methods Fall Short

Think about capturing a rat’s movement in 3D. Traditional methods typically require either:

Multiple cameras capturing the subject from different angles
Pre-defined skeleton models (like those used for human motion capture)
Complex setups that aren’t practical in many real-world scenarios

This becomes especially challenging when you’re working with animals, where you can’t simply apply human-based skeletal models, and setting up multiple cameras might disturb their natural behavior.

Rendered Rat

Our Innovation: Keypoint-Based Neural Radiance Fields

We developed a solution that combines the power of Neural Radiance Fields (NeRFs) with a flexible keypoint-based approach. Instead of relying on rigid skeletal models or multiple camera views, we use 3D keypoints and their relationships to capture motion. Here’s what makes it special:

Single Camera Solution: Unlike traditional methods, our approach works with footage from just one camera — making it much more practical for real-world applications.
No Skeleton Required: Rather than forcing a pre-defined skeleton model, we use keypoints that can adapt to any articulated object, whether it’s a rat, a robot, or any other moving subject.
Smart Background Handling: We integrated SAM2 (Segment Anything Model v2) to automatically remove backgrounds, letting us focus purely on the subject’s motion.

The Technical Magic Behind It

The real innovation lies in how we handle the 3D space. For each point in space, we compute:

Relative distances to keypoints
Directional relationships
View-dependent effects

This creates a rich representation that captures not just position, but the complete dynamic nature of the subject’s motion. Think of it as creating a dynamic 3D map that updates with every movement.

Real-World Applications

This research opens up exciting possibilities across multiple fields:

Animal Behavior Studies: Scientists can now capture and analyze animal movements more naturally
Computer Animation: Create more realistic animations without complex rigging
Biomechanics Research: Study movement patterns with less invasive equipment
Medical Motion Analysis: Track patient movements for physical therapy or diagnosis

Looking Ahead

While our current results are promising, we’re already thinking about future improvements:

Enhancing motion consistency across frames
Implementing real-time processing capabilities
Extending the framework to handle even more complex movements

The Bigger Picture

This project represents more than just a technical achievement — it’s about making 3D motion capture more accessible and practical. By removing the need for complex multi-camera setups and pre-defined skeletons, we’re opening up new possibilities for researchers, animators, and scientists across various fields.

Would you like to learn more about this research or discuss potential applications? Feel free to reach out or check out our project materials on GitHub!

GitHub - bargav25/RatNeRF

This research was conducted as part of my directed study at Boston University, building upon recent advances in Neural Radiance Fields and 3D computer vision technology.

Plan As You Go: How We Built an AI-Powered Boston Trip Planner in One Hour

Bargav Jagatha — Tue, 31 Dec 2024 05:26:43 GMT

Ever tried planning a trip and felt overwhelmed by endless browser tabs, conflicting reviews, and the constant fear of missing out on the best experiences? That’s exactly what drove my roommate and me to create Plan As You Go during a recent hackathon — a smart trip planner that combines real-time Boston events with AI-powered personalization.

GitHub - bargav25/weekend_planner

https://medium.com/media/50c89ca5fcd9387ae45f59ac481c0465/href

The “Aha!” Moment

As Boston residents, we’ve seen countless tourists (and even locals) struggle to piece together the perfect itinerary. Sure, everyone knows about the Freedom Trail and Fenway Park, but what about that underground jazz concert happening next weekend? Or that pop-up food festival in Cambridge? That’s when it hit us — why not create a tool that blends the best of both worlds: AI’s comprehensive knowledge of Boston’s attractions and real-time data about current events?

Building the Time Machine

The most exciting part? We built this entire project in just one hour! Here’s how we did it:

First, we tapped into https://www.thebostoncalendar.com/ to get real-time event data, ensuring our users would never miss out on the city’s latest happenings.
Then, we leveraged Gemini’s Flash API to create a smart recommendation engine. We crafted our prompts to generate structured JSON responses, making it easy to parse and display personalized recommendations based on:

Food preferences (because no one should miss out on Boston’s incredible culinary scene)

Date flexibility (weekend warriors, we’ve got you covered)

Budget constraints (from student-friendly to luxury experiences)

Personal interests (history buff? Food enthusiast? Art lover? Check, check, and check!)

The Magic Behind the Scenes

What makes Plan As You Go special isn’t just its comprehensive database — it’s how it understands what makes each trip unique. By combining real-time events with AI-powered recommendations, we created a system that doesn’t just list attractions; it crafts experiences.

Want to catch a Red Sox game and find the perfect pre-game dinner spot in Fenway? Our AI considers everything from walking distance to reservation availability. Interested in contemporary art? It might pair a visit to the ICA with an upcoming gallery opening in SoWa that perfectly matches your interests.

The Learning Experience

While we didn’t win the hackathon (note to self: always read the fine print about AI usage declarations!), we created something we’re genuinely proud of. Plan As You Go demonstrates how AI can transform the way we explore cities, making travel planning more personalized and spontaneous.

What’s Next?

This one-hour project opened our eyes to the possibilities of AI-powered travel planning. Could this be scaled to other cities? Could we add more real-time data sources? The possibilities are endless, and we’re just getting started.

For those interested in the technical details or wanting to contribute, check out our GitHub repository. Who knows? Maybe your contribution will help someone discover their next favorite Boston experience!

Remember, sometimes the best projects don’t come from months of planning — they come from recognizing a simple problem and realizing you have the tools to solve it right now.

Have you used AI to build something cool in record time? Share your story in the comments below!