Week 10,11 & 12of GSoC 2024 : Medicine Embeddings with PolyPhy/Polyglot

Ayush Sharma
2 min readSep 3, 2024

--

Introduction:

Hello, everyone! I’m Ayush Sharma, a passionate contributor in Google Summer of Code (GSoC) 2024 at UC OSPO. I’m excited to share my journey and progress in the project “Unveiling Medicine Patterns: 3D Clustering with Polyphy/Polyglot.” Guided by my mentors, Oskar Elek and Kiran Deol, I’ve been delving into medical data visualization using PolyPhy/Polyglot.

Week 10

Embeddings for the whole dataset

  • LLaMa3.1 : Created the embeddings for all the 55,000 medicines using LLaMa3.1 and Umap.
  • Gemini-Pro : Created the embeddings for all the 55,000 medicines using Gemini-pro’s embedding model and Umap.
  • Gemma2 : Tried creating the embeddings for all the 55,000 medicines using Gemma2 was unable to process all the medicines due to very high computation time.
Embeddings on a subset using Gemma2.

Week 11

Code optimization and cleaning

  • Code cleaning : Streamlined and refactored the embedding notebooks for better readability and efficiency.
  • Code Optimization : Tried implementing CUDA to accelerate code execution, aiming for faster processing times.

Week 12

MediGlot and GitHub

  • GitHub : Finalized and uploaded the complete code to the GitHub repository, ensuring all necessary changes were made.
  • MediGlot : Introduced MediGlot, a new sub-project derived from Polyglot, aimed at enhancing the scope and application of the original project.

Conclusion

Over the past few weeks, I’ve had the opportunity to deeply explore advanced embedding techniques and dimensionality reduction. Working with models like GEMMA 2, LLaMA 3.1, and Gemini-Pro, and employing techniques such as t-SNE and UMAP, has been both challenging and rewarding. Switching to Jaccard distance also opened up new perspectives on clustering medicinal data.

As I wrap up my GSoC journey, I can confidently say that the learnings have been invaluable. The guidance and support from my mentors, Oskar Elek and Kiran Deol, have been crucial in navigating the complexities of this project. With the introduction of MediGlot and the finalization of our work on PolyPhy/Polyglot, I’m excited about the future possibilities for uncovering intricate patterns in medicinal data.

Thank you for following along, and stay tuned as the project continues to evolve beyond GSoC!

--

--