Altsoph’s blog - Medium

What’s Wrong with TTS Evaluation

Aleksey Tikhonov — Fri, 15 May 2026 15:40:51 GMT

Besides other things, I am the Head of Evaluations at Inworld AI, where we also build TTS models. Our previous TTS model is still top 1 on the AA TTS leaderboard. Last week we shipped a new one, Realtime TTS-2.

To make any of that actually move, my team spent the last half year building a proper TTS evaluation system internally. Somewhere in the middle of that work I realized I had accumulated a mildly unhealthy amount of opinions about TTS eval. So here is my post about that:

https://altsoph.substack.com/p/whats-wrong-with-tts-evaluation

What’s Wrong with TTS Evaluation was originally published in Altsoph’s blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Quantum Persona and Test-time Mode Collapse

Aleksey Tikhonov — Fri, 25 Jul 2025 18:15:15 GMT

This time I want to explore a phenomenon I’ve been investigating: what happens when a model trained on contradictory information needs to give a single, coherent answer. The experiments demonstrate something I call “test-time mode collapse” — how models quickly lock into one consistent persona, and how we can influence which persona that is.

Quantum Persona and Test-time Mode Collapse was originally published in Altsoph’s blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

What’s Wrong with Chat-Templates Format for LLM

Aleksey Tikhonov — Fri, 18 Jul 2025 14:56:00 GMT

I’d like to discuss the current situation with LLM prompting standards; how we got into this mess; and how to live with it now.

Keep reading.

What’s Wrong with Chat-Templates Format for LLM was originally published in Altsoph’s blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

BIG BANG OF AGENT RULES

Aleksey Tikhonov — Mon, 30 Jun 2025 21:15:17 GMT

TLDR: Last weekend, I spent several hours digging through publicly available cursor rules files and analyzing existing usage patterns. Despite the presence of a lot of garbage and auto-generated content, I found several distinct strategies that people use to shape AI agent behavior.

What Are Agent Rules, Anyway?

For those unfamiliar, these rules are configuration files that shape how AI coding agents behave in your project. Think of them as prompting for developers — you write instructions about your preferred coding style, project structure, or workflow, and the agent tries to follow them. And, well, yeah, it’s basically just prompting under the hood, which means not all of them are useful.

Intro

Last weekend, it was too hot outside, so I finally managed to investigate something that had been bothering me for a while: how people actually use cursor rules in practice. You’ve probably seen these `.cursorrules` files scattered across GitHub repos, but are those really useful or they are just a cargo cult? I’ve read many recommendations and descriptions of different practices; I wrote such rules myself, but I wonder if there are systematic practices to use them. It’s hard to say without proper analysis.

You can probably guess what happened next — what started as a quick investigation turned into several hours of repository digging and analysis.

…this cartoon was e2e generated by my jokes+cartoons generation pipeline…

…CONTINUE READING…

BIG BANG OF AGENT RULES was originally published in Altsoph’s blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

COMIC CARTOON GENERATION

Aleksey Tikhonov — Tue, 01 Apr 2025 15:41:54 GMT

Since April Fool’s Day is today, let me share some of my results on automated comic cartoon generation.

Last time, I shared how we, with our Pavel Shtykovskiy, wrote a paper on Humor Mechanics, published on ICCC-2024, and how we, with Alexey Ivanov, launched HUMOR-ARENA to collect human labels and improve automated humor generation and ranking. After that, I decided to check if AI can be used to generate and filter proper cartoons (based on my one-liners previously generated with AI).

First, not each one-liner (even a good one) can be used as a base for a cartoon. Some jokes are too abstract, some wordplay can not be adequately visualized. That means we need an automated way to understand if the given joke is a good starting point. I’ve collected some examples of good and bad ones and asked a reasoning model (o1) to generate an instruction, a guide. It gave me specific rules, including checks for Visual Clarity, Concrete Elements, Scene Foundation, and so on. So, I took our top generated jokes (from Humor-Arena rating) and filtered them with claude-3.5-sonnet + o3-mini, both armed with that visual instruction. If any of these models thinks the joke is bad for visualizing, we reject it. That leaves us with 25% of jokes from the top.

Next, we need to generate the cartoons. (Note: this was done before recent releases of new image-gen LLMs, so now it will be even easier)
For a generation, I used a pair of o3-mini + DALLE-3 models; the trick is to provide enough details to develop a recognizable style and make a funny cartoon. Since I aimed to match some specific visual style, resembling classics like New Yorker’s or Floyd Gottfredson’s, I took a bunch of examples and reverse-engineered a generalized visual style description.

As for funny image creation, vanilla o3-mini wasn’t creative enough to come up with interesting details without hints, so, again, I used a superior model (o1) to generate meta-instructions, a guide on how to create an interesting cartoon based on a given one-liner.

That gives me a lot of cartoons, some of them (I’d say 20–30% I found pretty good, personally). Using cursor, I briefly sketched a script to resize the image and add the text of original joke at the bottom, using comic-sans, you know.

COMIC CARTOON GENERATION was originally published in Altsoph’s blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

HUMOR ARENA

Aleksey Tikhonov — Sun, 08 Dec 2024 13:01:27 GMT

TLDR: We developed a novel approach to humor generation that gives human-level results on blind tests. To facilitate further progress in humor generation and understanding, we made HUMOR ARENA, a site where you can participate in side-by-side labeling of various generated one-liners, see the ranking of models, and read the automatic top of generated jokes.

Quick quiz: of these 7 one-liners, only 3 are human-written. Can you guess which ones?

Anyone who has asked an LLM to make a joke knows how bad the results usually are. Usually, it responds with one of the memorized standard dad jokes (Why did the scarecrow win an award?.. Why did the tomato turn red?.. Why don’t scientists trust atoms?..). (Jentzsch and Kersting 2023) show that over 90% of the 1000+ jokes generated by ChatGPT were the same 25 memorized jokes.

Does this mean that modern models are basically incapable of generating a good joke, or is the problem that we are not explaining the task well enough? A good joke should be original, but in the training data, models can only see old jokes, and the better the joke, the more copies of it are likely to be encountered. What if we isolate this signal and only consider original, new, unique jokes? How can we decide which joke is funnier?

A successful joke is often based on some pattern, such as a broken expectations, pun, or a play on words. (Warren, Barsky, and McGraw 2021) refer to more than 20 distinct humor theories attempting to explain humor appreciation. Indeed, the success of a joke also depends on context and audience. If two well-known popular standup comedians switch texts, it is likely that the audiences of both of them will be disappointed. There are psychological studies showing the polarization of audiences according to different types of perceived humor (Thanks to Pavel Braslavski for discussions and advice on this topic). Long story short, there are workshops, conferences, valious research on these questions, but no one knows exactly how it works.

Earlier this year, my colleague from Inworld.AI, Pavel Shtykovskiy, and I decided to apply a data-driven approach to this problem. Taking a dataset of one-liners labeled by a large number of people with pairwise ratings (which of two jokes is funnier), we tried to reconstruct the set of rules behind the determination of the better joke, the so-called humor policy.
We then introduced a multi-step reasoning scheme with generation and consequent refinement of associations to generate novel one-liners on a given topic.

As a result, our generated jokes on blind labeling by humans were significantly funnier on blind labeling by humans than a baseline set of good human jokes — for comparisons, we used a subset of the dataset collected by (Weller and Seppi 2019) from Reddit jokes, and filtered based on user’s upvotes.

The results were pretty good to our taste, so we published all the details in our Humor Mechanics paper at The International Conference on Computational Creativity (ICCC) 2024.

After reading our paper, an old friend of mine, Alexey Ivanov from OpenAI, suggested we should create a platform where people can compare jokes generated by different models and thus form a ranking of models by their ability to make people laugh. After spending a few weekends, Alexey and I put together a prototype — Humor Arena. To aggregate pairwise scores into a single rating we used the new evalica library from Dmitry Ustalov (thanks Dmitry!).

To make the result more interesting, we also invented a way to automatically rank jokes based on the current partial pairwise labels, thus creating a beta version of the automatic top of jokes. The current top seems to be prone to dark humor and self-criticism. We’ll see when more pairwise scores are accumulated and the rankings are recalculated.

And yes, about the 7 jokes at the beginning of the post — 1, 3, 5, and 6 are model-generated, as are all the others — there were no human jokes among those jokes, sorry.

HUMOR ARENA was originally published in Altsoph’s blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

GALLERY OF UNSEEN

Aleksey Tikhonov — Sat, 30 Nov 2024 17:42:55 GMT

TLDR: It started as a bunch of strange experiments and ideas on unsupervised generation and converged into another NaNoGenMo project. This post contains a write-up on the nuances and technical details of this work.

Stage 1: Evolutionary search for aesthetics

First, I was playing with the evolutionary search of the optimal prompt for the SDXL model to maximize the aesthetic scores of results measured by Google’s pretrained NIMA model. As a starting set of prompt parts, I used something I did before in my Freaking Architecture project, so initially, images were biased toward architecture and sculpture. The evolutionary search is not the fastest thing I know, so I wanted to speed up the process and used LCMScheduler with lcm-lora-sdxl to lower the number of inference steps down to 4.

The genetic part was pretty straightforward: As a genome of an individual, I took a float vector of length 100, with probabilities to select one or another piece of prompt; the pieces were like these:

"fluid and dynamic forms", 
"golden silver elements", 
"googie motifs", 
"gray stone", 
"hexagonal pattern", 
"houses and roads", 
"in a ravaged library", 
"in a square", 
"limestone",
...

To evaluate individuals, I randomly sampled up to 15 pieces of the prompt with the corresponding probabilities, merged them together, sampled an image from SDXL, and scored it with NIMA. Since both prompt generation and image generation are stochastic, I repeated both steps up to 10 times and averaged scores. The size of a generation pool was 20; I used generic random mutations and standard cross-over.

The both average and best scores slowly crawled up with the time:

Manual debug showed no signs of degeneration as well:

So I ran it in my colab pro account and left it running overnight. In the morning I got 50K+ of completely insane images; it was impossible to even check them all out.

I decided to focus on those with aesthetic scores greater than 6.0 (a pretty hard baseline), but still, there were 22K+ of them. I had to invent some way to find the most interesting images automatically.

Stage 2: Visual style clustering

To continue experiments, I’ve decided to try marimo— it’s some fresh jupyter analog, and I wanted to give it a try (overall: so far, it looks interesting but a bit unpolished; I should try it again in a half of a year maybe).

Digging through the pile of images, I’ve noticed there are several distinctly different visual styles standing out — like photos, sketches, paintings, and so on. So, I decided to group pictures by visual style somehow, for starters.

The general plan was to embed images into vectors, then lower the dimensionality of latent space, then cluster these low-dimensional vectors, and, finally, explore the clusters.

I tried the DINO (v1) model embeddings first, but the resulting clusters were visually too internally diverse, so I didn’t see any clear corresponding style:

After a short research, I switched to the generic VGG16 model and took only the 16th layer weights as a style embedding. That worked much better. After embedding, I made a UMAP projection of embeddings into 2d space and ran DBSCAN over it.

The top 3 style-based clusters had more than 1k images each, and the top 10 had more than 150 images. Visually, these clusters were pretty consistent:

For the rest of my experiments, I took the images from the top 3 clusters — one was all about some huge empty gray rooms; another had drawings, collages, and sketches; the third was something like church interiors and strange semi-organic rooms.

Overall, almost all of these images were pretty good. However, there were still too many very similar ones among them — multiple images of very similar objects; perhaps it was the result of multiple (10x) runs of aesthetic scoring for each individual, so each prompt was used to generate multiple images.

Anyway, I decided to make some deduplication.

Stage 3: Semantic deduplication

To do the deduplication, I embedded these selected images again, this time with the CLIP, since it should capture semantics better. Again, I applied UMAP + DBSCAN to get clusters across this subset of images.

Thus, I decided to take only one image from each semantic cluster, specifically the one with the maximum aesthetic score, and ended with approximately automatically selected 200 images — they had

high aesthetic scores (scored by model),
more or less the same style,
and were diverse enough.

They were also generally more or less about art, sculpture, and architecture, so I decided to convert them into something like a guidebook.

Stage 4: Essay generation

To complete the guidebook, I needed some textual descriptions of my selected images. Indeed, I could just ask some VLM to write these descriptions, but usually, the results of such approaches are weird — such texts are usually full of clichés and general formulations, and when there are many of them, it is instantly obvious that they are very similar in structure.

To address these issues, I used an approach similar to what I did in our recent humor generation project before — I used association generation and a multi-step brainstorming framework. I also generated a short set of hints and recommendations for catalog writers (10 hints).

Finally, I requested OpenAI GPT-4o to generate a short essay about each image, providing it with the description, associations, and a random subset of the writer’s hints.

Stage 5: Final polishing and sharing

After the final cleanup, I finished with 150 images and approximately 52K words of text. To make the layout look better, I searched for some automatic framework for publishers. Eventually, I used WeasyPrint — this easy-to-use python library allows to convert generated HTML to PDF and control page layout details with custom CSS instructions.

Finally, I made a cover — this is the only part I’ve done manually (still using one of the generated images):

The resulting PDF and code are available on my github. I’ve also made a NaNoGenMo submission to share it with the community; there, I promised to provide more technical details, so that’s the reason I wrote this post.

GALLERY OF UNSEEN was originally published in Altsoph’s blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

QR-DICE: 6-sided QR-cube

Aleksey Tikhonov — Sun, 27 Oct 2024 16:40:43 GMT

UPD: Got the Hack of the Month @ Berlin Hack&Tell #100 with this project

Here’s another project I never properly documented when I initially put it together. The idea hit me out of nowhere: could you arrange black cubes in a 3D grid inside a 21x21x21 cube so that projections from each face display six different QR codes, each with its own message?

Wait a minute, you might think — creating three unique projections sounds doable, but opposite faces would mirror each other, right? Not necessarily. I’d previously figured out how to make double-sided QR codes that can display different messages depending on the orientation, thanks to error correction. This trick works well for short messages (under, like, 8–11 characters), though some QR readers might struggle to interpret them.

Building the Cube: 3D challenge

The main question became: could we combine three pairs of projections along different axes, so all six QR codes would display as intended?

To do this, imagine having three projection candidates and needing to check if they’re compatible. If a cell on a projection is white, then every cell in that column should be white to avoid casting shadows. So, we start with a fully filled cube and erase rows or columns where a projection cell is white. If, after this, enough black cubes remain to cast the necessary shadows, the projection candidates work together.

Projection candidates

The next step was to generate compatible projection candidates, which required a bit of creativity. To maximize possible combinations, I looked at padding options for each QR code. For instance, imagine I wanted to put the message “ONE” on one side of the cube and “TWO” on the opposite side. If I pad “TWO” with spaces on either side, I could create slight variations such as “TWO_____“, “_TWO____“, etc. I do the same for “ONE” at the same time, so it several hundreds of suitable codes — each unique padding arrangement yielded a distinct “double-sided” QR code.

Repeating this for 3 pairs of messages (ONE+TWO, THREE+FOUR, FIVE+SIX), I ended up with 216, 134, and 186 combinations correspondingly. Altogether, it gives us 5+ millions of configurations to test. While this sounded overwhelming, brute-forcing through these combinations revealed that around 1 in every 200 combinations produced a compatible result.

The First Combination & Beyond

If you’d like to try it out yourself, here’s an interactive demo showcasing my first working combination.

Too lazy to try a QR reader? No problem; you can watch my quick 42-second video.

https://medium.com/media/d1af849a0f9a184939ce5219172026a7/href

After finding first solutions, I wondered: can we achieve the same effect with fewer cubes? Starting with about 1,700 cubes (in an average solution), I applied a greedy algorithm, deleting cubes one by one until no more could be removed without disrupting the projections. The result: a minimalist cube with just 375 cubes, readable from all six sides — though it looks a bit rough, like an unfinished Death Star:

Bonus: minimal, aesthetic solutions

Reducing cube count further posed a new question: how to create a visually balanced cube without sacrificing readability. A quick test with random sequences reduced the cube count to 294, yielding a much more appealing structure.

For anyone up for the challenge, feel free to explore further and find even better solutions. The code for experimenting with these cube combinations is on GitHub.

QR-DICE: 6-sided QR-cube was originally published in Altsoph’s blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

FREAKING ARCHITECTURE

Aleksey Tikhonov — Sat, 12 Oct 2024 15:25:15 GMT

TLDR: We built a pipeline for generating diverse images using neural networks and publish them automatically on Telegram, Mastodon, Bluesky, and Tumblr. Later, we analyzed user reactions to improve prompts. Our findings were presented at HuMaIn@KI-2024. Follow the feeds or learn more on the project page.

A couple of years ago, I decided to investigate ways of possible unsupervised generation of high-quality, but diverse images using neural networks like Stable Diffusion. My goal was to create an automatic end-to-end pipeline that produced as few bad results as possible. I teamed up with an old fellow, s0me0ne, and we began experimenting.

At first, we had a simple setup: random prompt generation based on a “kaleidoscopic” combination of a large list of keyphrases. But as the project developed, things got more complex. Over time, we arrived at a process where that initial prompt was just the starting point. The final image would go through several modality shifts, using three generative networks plus a couple of auxiliary networks to assess quality. Ablation studies showed that every step of the pipeline contributed to improving the results.

Early on, we set up automatic publishing of the images on Telegram to cut off our ability to moderate content. Later, we added feeds on Mastodon, Bluesky, and Tumblr (getting it to post automatically on Twitter and Instagram didn’t work out right away yet).

About a year in, we had another idea. Since the Telegram feed stored a history of user reactions, we could download emoji responses and match them with elements from the original prompts via the images. This allowed us to identify keywords that statistically increased or decreased the chance of getting a reaction (thanks to Vadim Nikulin for helping with the history dumping).

Eventually, we even wrote a research paper, “Machine Apophenia: The Kaleidoscopic Generation of Architectural Images”, speculating on an idea of the Machine Apophenia effect, and presented it this past September at HuMaIn @ KI 2024.

You can subscribe to these feeds via the links above; we randomly publish 3–7 images a day to avoid spamming, and there are already over 4.5K images in the feed. For more technical details, check out the project page.

PS. It’s hard to say if this project is truly finished — every time we thought it was, new ideas emerged. Some, like latent space analysis and auto-generation of NERF spaces, are still open.

FREAKING ARCHITECTURE was originally published in Altsoph’s blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

ADMINISTRATIVIA

Aleksey Tikhonov — Sat, 12 Oct 2024 14:01:38 GMT

It’s been a while since I last posted here — almost four years, actually! A lot of different things have happened in that time, and I’ve collaborated on quite a few strange and interesting projects. So, I’ve decided to try and catch up by occasionally posting about some of my older, but still undescribed, works.

Also, for your convenience, I’ve set up a mirror of this blog on Substack.

ADMINISTRATIVIA was originally published in Altsoph’s blog on Medium, where people are continuing the conversation by highlighting and responding to this story.