SimSimi Publicly Discloses Large-Scale A.I. Dataset for Human-Centered AI Development

simsimi_official
2 min readSep 23, 2022

Did you know that SimSimi is publicly releasing its large-scale human-centered AI dataset that has been accumulated over the past 20 years to the academic community?

In case you didn’t know, SimSimi had already released its ultra-large conversational dataset in August 2021. About 15 billion conversational data was disclosed, which fostered the development of AI research in South Korea.

Now in September 2022, SimSimi is disclosing its dataset once again to contribute to the development of domestic AI research and innovation.

Human-Centered AI (HCAI) is a concept that has been gaining much attention since its appearance in 2019, as the G20 Trade and Digital Economy Ministers outlined G20’s commitment to a human-centric approach to AI.

Essentially, Human-Centered AI is artificial intelligence putting humanity at its forefront. The goal of Human-Centered AI is to augment human skills rather than replace humans.

In recognition of such goal, SimSimi decided to release a total of four-type human-centric AI datasets.

The first type of dataset is derived from reported conversation scenarios. The general users of SimSimi are allowed to report any chat that seems to be in violation of content regulations, in which case the detailed reasons for reporting are attached as a label. These 14 million labeled datasets, in combination with other metadata, are to be released first.

The second type of dataset involves general conversation scenarios. These 1.15 million datasets have been constructed by selecting and processing sentences from SimSimi chatbot dialogues (atext) that can be used by anyone. These datasets can be utilized for voice applications that involve text-to-speech (TTS) recognition.

The third dataset deals with irregular general conversation scenarios. Simply, it means datasets that have been sorted out during data processing for general conversation scenarios mentioned above, as they were confirmed to be unsuitable for TTS.

Lastly, there are the datasets that involve sentences that were sorted through blind inspections. These 35 million datasets have been given scores by general users of SimSimi depending on the severity of violation against content regulations.

Any researcher who wishes to access SimSimi’s conversational datasets can download and fill out the application form from our official Korean website. Once approved, you can begin to use our datasets for good academic purposes.

About SimSimi:

With over 350 million downloads worldwide, SimSimi is a beloved chatbot which uses artificial intelligence to create messages and interact with users. You can talk to SimSimi anytime, anywhere, and sometimes you can teach him/her to say whatever you want. Click here to download SimSimi, your smartest AI chatbot companion.

Stay in touch with SimSimi

Website | Twitter | Instagram | Facebook | Telegram

--

--

simsimi_official

Hello! This is SimSimi’s Official Medium. Follow the link to meet SimSimi, your smartest AI chatbot companion: https://campsite.bio/simsimi