satojkovicPaper Review: Pixel Aligned Language ModelsIn previous research on vision and language alignment, most studies have used the entire image as input. In contrast, this paper proposes…5d ago5d ago
satojkovicPrompt Ensemble in Zero-shot Classification using CLIPCLIP is pretrained to predict whether an image and text are paired within a dataset. As shown in (2) and (3) of the diagram, for zero-shot…May 6May 6
satojkovicScenic: A JAX Library for Computer Vision Research and BeyondScenic is an open-source JAX library focused on Transformer-based models. I recently saw the library while reading a paper on video…Mar 20Mar 20
satojkovicCreate your own GPT and generate text with OpenAI’s pre-trained parametersThe first entry for 2024 is about GPT. Create your own GPT model, load the pre-trained parameters published by OpenAI, and perform a series…Jan 141Jan 141
satojkovicPaper Review: Video-LLaMAResearch on LLM seems to be accelerating with the release of LLaMA and LLaMA2 by Meta, and I read Video-LLaMA in a study on the…Sep 10, 2023Sep 10, 2023
satojkovicJAX and composable program transformationsThe About section of https://github.com/google/jax states the following.May 21, 2023May 21, 2023
satojkovicVision Transformer from scratch (JAX/Flax)Recently, I started to use JAX/Flax. Here, I would like to show how I implemented Vision Transformer (ViT) using JAX/Flax and how to train…Jan 23, 2023Jan 23, 2023