Z Potentials — Crafting the Future of 3D: Insights from VAST’s CEO

Tripo AI
12 min readJan 3, 2024

--

Welcome to our enlightening series brought to you in collaboration with Z Potentials, where we, at VAST, are excited to introduce our esteemed team of visionaries and innovators. This series offers a unique window into the minds of our key board members, starting with our CTO, Liang Ding, a luminary in technological development, and our CEO, Simon Song, a trailblazer in digital entrepreneurship. As we progress, you’ll gain insights from other pivotal figures shaping VAST’s journey. Each interview, conducted by Z Potentials, is a testament to our commitment to pioneering the future of 3D content creation. Join us in exploring the driving forces behind VAST’s groundbreaking advancements in the digital realm.

Following the previous issue of VAST, Z Potentials (hereafter referred to as “ZP”) engaged in a deeper discussion and exchange with Simon Song Yachen, the founder of VAST. Born in 1997, Simon Song has always displayed an independent and multifaceted personality, seamlessly blending playfulness with knowledge. He doesn’t consider himself a typical young entrepreneur, yet he embodies many qualities we associate with one — boldness, a willingness to act, and a disregard for authority. In the wave of generative AI startups, many young entrepreneurs lean towards applications and community building. However, Simon Song has chosen a path of developing proprietary models and applications, specifically in the challenging yet highly imaginative field of 3D. He is passionate about the revolutionary applications of AI in the digital 3D world, firmly believing that technology can break creative boundaries and unlock human potential. Let’s dive into the entrepreneurial journey of this “atypical young entrepreneur” with Simon Song.

ZP: Could you introduce us to your student life?

Simon Song: Hello everyone, I am Simon Song, the founder and CEO of VAST.

I grew up in Hangzhou, so I was deeply influenced by traditional Chinese culture. In the first and second grades of elementary school, I stumbled upon classics like the “Tao Te Ching” and the “Platform Sutra” in my father’s bookshelf, which sparked my interest in this area. I even spent some time practicing Zen at Jinshan Temple during middle school, igniting my passion for religious philosophy.

In high school, I applied to over thirty universities, mostly related to theology. Eventually, I went to Johns Hopkins University to study politics in the Middle East, as this major offered the opportunity to explore religions and humanities from around the world. To better understand this field, I learned Hebrew and Arabic. In my sophomore year, I spent a year in Israel and visited many unique places like Georgia, Bermuda, Morocco, and Cuba, experiencing diverse cultures and learning to embrace complexity.

Games also had a huge impact on me, as it seems hard for guys my age not to love gaming (laughs). My passion for gaming was as intense as for religious philosophy. At Hopkins, I played games so much that my advisor had to bring my parents to the U.S. to address this issue. There was a huge dent in the middle of my dorm bed from spending long periods reading and gaming.

So, my experiences might seem contradictory to some, like a mix of a religious philosophy enthusiast and a gamer who has traveled extensively?

ZP: When did you first get involved in the 3D field?

Simon Song: I joined SenseTime before graduating, where one of the business directions was 3D animation. I spent a lot of time communicating with various people in the domestic 3D animation industry because we were focused on AIGC, and thus very concerned about the efficiency of 3D content production.

I noticed that even the most talented content creators in China, graduates from top institutions like Tsinghua and CAFA, had to spend a lot of time on labor-intensive tasks, significantly limiting their creative potential.

The industry had long recognized this problem, but at that time, the only solution was to rely on better software tools to optimize efficiency. However, I realized there was a huge gap between content creators and software engineers. So, we used AIGC tools to serve many leading animation companies in China. At the same time, we also tried creating some animated content with AIGC tools, eventually building a social media account with millions of followers.

In this process, I gained some insights:

Firstly, people only consume top-notch, 10/10 content, not mediocre 9/10 content. So, flooding platforms with AI-generated content is meaningless if it’s not high-quality. Good content must be designed by humans: it needs aesthetics, a storyline, and control. If that’s the case, then AI can be used to execute and realize these designs. Secondly, the animation industry is labor-intensive. The technology level at that time could reduce the overtime hours of animators from 10 PM to 9 PM, but it couldn’t spark more creativity. Neither the quality nor the efficiency of content production had reached the Product-Market Fit (PMF) stage.

Therefore, I believe that to fundamentally solve this, the key is to liberate 3D content productivity. What’s needed is a revolutionary 3D content typewriter — a large 3D model. This is a fundamental first-principle need. Creators should spend more time thinking and leave the manual work to the large 3D model.

ZP: When did you start thinking about entrepreneurship?

Simon Song: I had never thought about starting my own business before. During my work, I met many friends with rich experience and insights in 3D content and AIGC. In our discussions, I found our visions for 3D content surprisingly aligned. As someone with strong initiative, once we saw the same future, entrepreneurship seemed like a natural step, and that’s how VAST started.

ZP: Why did you decide to release some open-source work?

Simon Song: This involves a key understanding: 3D is a cornerstone of multimodal scenarios and can only realize its full value in specific applications. This is different from text and video content, which are information carriers and complete their experiential loop when viewed. But 3D content needs specific application scenarios to reveal better content and experiential value.

VAST’s vision is to liberate 3D content productivity. Besides solving foundational issues with our large 3D model, we also need more developers to unleash their creativity. We understand that the industry and many independent developers have longed for applications in this area. 3D is a high-barrier field, so “saturation innovation” is crucial. Open-source models are essential to support this, and we feel obligated to push everyone to work towards this direction.

ZP: As a model + application company, how do you consider your product line?

Simon Song: Currently, our products fall into two main categories:

The first category is our ToBToC business, designed to iterate our technology. We have about dozens of strategic partners, covering almost the entire 3D industry pipeline, from industrial, 3D printing to gaming, and animation companies. By serving these partners, we get specific and grounded feedback to iterate and optimize our technology, even helping us refine our tech strategy. This is unique to us; from day one, we’ve insisted that technology must be applied to scenarios and form a positive market iteration flywheel.

The second category is our Tripo brand. Tripo 1.0, launched on December 21, can generate high-precision 3D models with complete textures and meshes using text and images. This process is divided into two stages: the first stage takes about 5 to 10 seconds for a rough screening of the results, followed by a second stage of high-precision generation, which currently takes about 5 minutes in our 1.0 version. Its core is to showcase our technological strength and get more people to use it, while the usage data helps optimize our large model’s algorithm strategy. We will also provide a generic version of this large model capability to some strategic partners in the future.

Moreover, some core third-party developers have started using our beta version to design small games. At the end of the month, our developer community will release an AIGC game called Tripo GO. One of our beta users, a seasoned Pokémon player, used our beta large model to fulfill their dream of creating a game. Like our open-source strategy, we encourage and promote more such content experiments, hoping to see new content categories emerge in the era of large 3D models.

ZP: Why are you doing this?

Simon Song: We believe there are three key variables for the explosion of 3D content: 1) Lowering the barriers to distribution. 2) Reducing the production barriers. 3) Decreasing the experience barriers. Lowering the production barrier is achieved through our large 3D model, which is a critical point limiting the industry’s content explosion. Reducing the distribution barrier involves creating a hosting platform that provides a more native experience for 3D content, not hardware like XR, but a software platform closer to user experience scenarios. Lowering the experience barrier is about operational promotion, similar to the early days of TikTok and Kuaishou. With hardware and software platforms ready, people needed a more concrete content paradigm to understand and experience short videos. We will focus on developing this in our developer ecosystem.

These three things seem complex, but they boil down to one belief: there will eventually be a 3D content platform that carries most 3D content. It will have extremely low creation barriers, with content that is shorter, flatter, faster, and more exhilarating. It might even break the traditional division of labor between DCC software and engines. This is the future we envision.

ZP: Can you explain more about this content paradigm?

Simon Song: We believe the content paradigm needs to find its place in both the changing and unchanging elements of content scenarios.

The changing part is the efficiency and method revolution brought by production tools, which will inevitably unlock many content spaces previously constrained by various efficiency and resource limitations. For example, a large game with hundreds of bosses and thousands of items would take years to produce assets, but now, basic asset production might take months or even weeks. This is just the change in pure resource production; the liberation of scenario design and creativity is even more exciting.

The unchanging part is the content experience. Once the bottleneck of production efficiency is solved, the well-trodden content path of being flatter, faster, and more exhilarating will re-emerge in the 3D content field. According to the evolution of past content industries, each such emergence signifies an industry explosion.

ZP: Who are VAST’s target users? How do you understand their needs/pain points?

Simon Song: Our team members themselves have rich project experience, and combined with our extensive communication and research in the industry, we have developed a user cognition model. We divide the entire 3D industry’s users into four levels:

The first level is professional pipeline users, traditional DCC software, and some industrial engine users. These are the elite producers in the industry, but this group is not large, around a hundred thousand people in China.

The second level is non-professional 3D creators who use professional 3D software like Blender but don’t specialize in this work. They might be doing a graduation project or just pursuing a hobby. These users are very sensitive to productivity tools and have an urgent need for them. Currently, the quality of content they produce is generally not as good as the first level, and most of it is not yet consumable, but they are real producers. This group has grown rapidly in recent years, already exceeding a million people in China and tens of millions globally.

The third level is game users, like me: sensitive to 3D content and experiences, finding it fun, but lacking the ability to create 3D content or knowing how to systematically learn it. For example, if you ask me to generate a Pokémon, I’d find it interesting, but please don’t ask me to model it; it’s hard for me to learn in a short time. These are what we call the mainstream audience.

The fourth level is the novelty users, those who just want to play around and see what AI can generate, willing to try anything without a clear content goal.

ZP: So, which user levels are we mainly targeting?

Simon Song: We want to serve the first level through a B2B2C approach, but our focus will be on the second and third levels. The third level will be the main consumers in the future large 3D model ecosystem, and the second level, those who use Blender, will be the producers in the large 3D model ecosystem, similar to roles like YouTubers and streamers. They need a content paradigm to guide their production.

ZP: As a company focused on model plus application, and a self-developed model driven by end-to-end data, how would you define our competitors?

Simon Song: 3D is one of the most technically challenging tracks in the generative AI field. Currently, whether it’s industry giants or startups like us, we’re all in the early stages of technological implementation.

We can clearly see that global intellectual and computational resources are pouring into this direction, with related research and papers advancing rapidly.

At this stage, our talent reserve and technical capabilities are among the global leaders, and I believe everyone will continue to see our achievements.

In terms of our ultimate vision, we’ll be competing with other content platforms, essentially competing for users’ time.

ZP: Why do you think the 3D content paradigm has great potential? What is our long-term competitive edge in the 3D content paradigm?

Simon Song: First, regarding competitiveness, I don’t see a distinction between short-term and long-term. Once competitiveness emerges, it accumulates into long-term competitiveness. The key is forming that competitiveness, which is much harder than maintaining it, especially for startups. In terms of competitiveness, I think several things are very important:

First, technology is the most important. It truly lowers the barrier for users to create content. It’s like the guy next door being able to create 3D content, which is key. Without smartphone cameras, there would be no TikTok. Our 3D generation can be seen as the smartphone camera. This is crucial; it’s like the typewriter, the movable-type printing. Almost all content we experience traces back to a key technological breakthrough.

Second is finding the right content paradigm. You need to be sensitive to user needs. Only when a content paradigm is validated can a new content category be born. Once a content category is ignited, subsequent productivity will flow more efficiently into this track, and the emergence of more content categories will increase exponentially, somewhat like the Cambrian explosion.

Of course, there are many other aspects, like engineering capabilities, etc. These are essential but difficult to form into a competitive edge. A true understanding of real users and belief in what we’re doing are also crucial. In the field of innovation, “vision is capability.”

ZP: What is VAST’s long-term vision and mission?

Simon Song: Our company’s vision is: “To advance civilization for the world, to create happiness for humanity.” This phrase was said by Li Dazhao.

ZP: How do you interpret this phrase?

Simon Song: The core of a content platform is to provide users with the ultimate experience. Being human, with 24 hours in a day, is about choosing different experiences: sleeping, exercising, reading, watching movies, etc.

I hope that in the future, 3D content will expand infinitely, and everyone can become a super creator, creating new worlds like gods, with their own rules, a new social evaluation system, new worldviews, forming new social relations, ultimately creating a brand new world. Your goals, beliefs, and sense of achievement will completely change.

In this scenario, many of today’s social problems will be solved. Everyone’s life will be an ultimate flow experience, constantly encountering small difficulties and growing to solve their own problems, forming a flow experience process like a protagonist in an idealistic anime.

ZP: Do you have any game recommendations?

Simon Song: Disco Elysium.

ZP: As a young entrepreneur born in 1997, do you feel there’s something different about you? Is there anything unique about the VAST team?

Simon Song: I think our team at VAST is divided into two groups: The first group consists of veterans who have successfully gone from 0 to 1 in entrepreneurship. The second group is made up of very vibrant, extremely young geniuses.

These young geniuses, fearless like newborn calves, work passionately and intensely, often producing excellent results quickly and sometimes achieving miraculous breakthroughs. The veterans are more steady, resilient, and experienced. They don’t panic; if they fail, they just move on to the next step without seeing it as a big deal. Also, having found oases before, they know which directions are more likely to lead to success again. I believe that a team shouldn’t consist solely of young people or veterans; a combination of both can form a mature and powerful team. Veterans help young people find direction and rhythm, while the young can achieve miracles with their energy and fresh perspectives. Both groups complement and enhance each other.

The above interview content has been edited and organized. If you want to learn more about VAST, feel free to visit tripo3d.ai for a free trial and to leave a message in the backend.

--

--