Notes on The Web Conference 2020

Published in

SocialDynamics

17 min readApr 24, 2020

The Web Conference 2020 has ended today. I took some notes about the talks I could attend — hope they help. I divided them by six research challenges, which I start to summarize here so you could jump to those of your interest.

Summary

Challenge 1. Building trust on “sharing” economy platforms. @Airbnb is into the so-called “experience economy” — but is the platform really about social interactions or is it simply about business transactions? Can a user’s propensity to trust be automatically facilitated? Can user reviewers be made fairer (think: the last time you were unfairly pissed off at an @Uber driver)?

Challenge 2. Tackling misinformation. How did a Russian agency spread (mis)information during the 2016 US presidential election (@MSFTResearch scientists to the rescue)? Can misinformation be detected from just the way it propagates through a social network without looking at the content at all?

Challenge 3. Data4Good: tech for societal issues. Is UK facing food insecurity (and how UK food sharing @OLIO_ex could help)? Can illegal drug demands be predicted across the world (spoiler: yes, by mining data from @Wikipedia& DarkNets)? Can posts on @Twitter be descriptive of maternal mortality in US?

Challenge 4. Text: King of the Conference (i.e., how to make text processing a bit smarter). Generally text is the most widely studied type of data at The Web Conference, so interesting work tends to emerge in this space every year. Take emotion extraction, for example. So far most scientists have worked with a 8-emotion classification — now, with a new dataset presented this year, they could go for 239 emotions. In addition to this work, I would like to mention two others. First, for the very same conversation, intent and impact could be totally misaligned (think: the last time you argued with your partner): can a machine predict this difference by processing Facebook conversations (thanks @fb_research)? Second, can a machine automatically distinguish the figurative use of language (it isn’t easy)?

Challenge 5. Beyond text: the role of time in conversations. What’s the role of time when analysing conversations on the mental health forum @TalkLife, on the code sharing platform @GitHub, and among parliamentarians at the @Europarl_EN (spoiler: our representives put a lot of effort into GDPR — who said that nobody cares about privacy)?

Challenge 6. Which new forms of interaction will win? From bots to viz. Unsurprisingly, @GoogleAI & @Amazon folks are building smarter bots. We still have a long way to go though :)

Thanks!

Before going to the first challenge, let me thank two groups of people:

My collaborators at Nokia @BellLabs for presenting a deep-learning work that mines conversations as humans understand them in the real world, for example, conversations offering social support, those mediated by power relationships, those exchanging knowledge. The work argues that social support, power, knowledge are three of the ten dimensions whose combination universally defines any social relationship. The work is beautifully described in this post. The resulting deep-learning algorithms are able to predict suicidal rates from social media posts across US states, and the fall of Enron from email exchanges. They could also be used to rethink the way we model relationships in social networks, and to move on from the now pervasive “sentiment analysis” within corporations [read frustration? only natural] :)

The presence of five types of conversation (plus sentiment) in emails exchanged by Enron employees during times of: i) the initial concerns about the financial stability of the company; ii) the first round of layoffs; iii) the start of financial losses; iv) the declaration of bankruptcy. Sentiment classification does not tell the whole story!

2. Mia Cha (@nekozzang)and Marc Najork (@marc_najork) for inviting me to give a talk at the BIG track. The talk mainly covered two main works: one about tracking biorhythm changes during Brexit and Trump with smart watches in London and San Francisco (the Hearts&Politics project), and the other about predicting food-related illnesses from grocery fidelity cards (the Food&Health project). I’m very happy that the entire 1-year grocery dataset for London has been made available in a recent Nature Scientific Data article. Other speakers of the track included: @ladamic of @fb_research who mentioned exciting data sharing intitiatives at her company; Andrew Tomkins of @GoogleAI who talked about Graph-RISE, a powerful graph regularization; @Ee_Peng_LIM who is working on food image detection; Christopher Ré who is tackling the difficult problem of lack of training labeled data (he built a weak supervision approach called snorkel); @ingmarweber who showed how the “Facebook Ads platform” could be used for innovative demographic research, for example, to study the recent Venezuelan exodus or the urban-vs-rural divide in Italy; finally, @kgummadi who has done excellent work on how humans and AI could collaborate in the near future, and could do so in fair ways.

The six research challenges

Challenge 1. Building trust on “sharing” economy platforms

Social Interactions or Business Transactions? What customer reviews disclose about Airbnb marketplace by Giovanni Quattrone & co. In the context of the sharing economy, a key question is whether Airbnb is a hospitality service that fosters social exchanges between hosts and guests, as the sharing economy manifesto originally stated, or whether it is (or is evolving into being) a purely business transaction platform, the way hotels have traditionally operated. To answer these questions, the researchers built a machine learning framework that mined millions of guest reviews to distinguish which ones were about social interactions or which were about business transactions. It turns out that, year after year, business transactions are overtaking social interactions. The good news is that early adopters who tended to be more social when they joined tend to be still so nowadays.

Designing for Trust: A Behavioral Framework for Sharing Economy Platforms by @natabarbosa, Emily Sun, @juddantin, and @PaoloParigi19. The main research question here is how to quantify an Airbnb user’s propensity to trust other users. The scientists at @Airbnb research came up with a method to create explanatory and predicate models of user propensity to trust, which is extremely useful to understand for platforms like Airbnb in which new users join every day. The method relies on the results of one experiment and on the analysis of user logs:

Experiment. This took the form of a trust game. Players made investments based on synthetic and controlled user profiles. To put it simply, they saw a profile, were asked how many credits they wish to invest, and were also told that the other players will make a return of investment. The top-15 players (ranked by the number of points they got at the end of the game) received a 50 dollars gift card. Overall, the researchers ended up with 4.5K players (2.6K guests, and 1.8K hosts).
UI logs to model user behaviour. The features extracted from the logs were chosen based on theories of trust, and included: reviews; interaction with listings; and past trust actions.

Based on their analysis, the researchers found that Airbnb users whose trust propensity is low tend to: exchange way too many messages (communication needs to improve), reject requests to reserve (acceptance could be made mandatory), engaging with other users’ profiles (more transparency of profile reviews should be promoted). On the other hand, Airbnb users whose trust propensity is high, unsurprisingly, tend to receive high ratings and engage with new users more broadly.

Reputation Agent: Prompting Fair Reviews in Gig Markets by @ctoxtli & co. The researchers built a new tool called Reputation Agent to promote fairer reviews from customers of sharing economy platforms. In their evaluation, they focused on three platforms: @Uber (used to hire drivers), @Upwork (used to hire freelancers), and @Grubhub (used to hire food delivers). The Reputation Agent has two main components: a ‘Smart Validator’ (a set of machine learning models), to automatically detect elements of a review that includes factors outside a worker’s control; and a ‘Fairness Promoter’ (a set of interfaces), to guide the customer to focus his/her review on the factors that were within the worker’s control. If the customer is writing an unfair review, the Reputation Agent is triggered to explain that the review could be unfair towards the worker. To detect unfair reviews, the researchers needed training data. To that end, they: “(1) collected 1,000 real-world reviews from @SiteJabber (for the three scenarios); and (2) had two independent college graduate coders classify each of these reviews into whether they involved worker’s performance or factors outside the worker’s control.” They tested the agent by recruiting 480 participants who were divided into 12 conditions (3-by-4 experimental design): 3 gig platforms, and for each platform, 4 different interfaces were evaluated (control, control+rating, reputation agent, reputation agent+rating). To sum up the results, I would say that the agent promoted a great deal fairness compared to current real-world approaches. Also, the results pointed to three different areas of UI improvement:

Empathy of customers. “I like to think about all the circumstances before writing reviews. I like to use empathy. In my future review I will probably be a bit less on drivers. I will think about the driver and how they treated me as well. But before I make those change I will try to calm down first…”
Agency of workers. “Drivers should be able to tell which routes are clean by instinct based on the day and the time of the day without even looking at the GPS. The drivers should know the city that they are driving very well…”
Truthfulness from customers. “I always give honest and detailed reviews and will continue to do so. This interface will not impact on my reviewing process. I will always stand by how I have done things in the past: with the truth!”

OpenCrowd: A Human-AI Collaborative Approach for Finding Social Influencers via Open-Ended Answers Aggregation by Ines Arous of @unifr & co. It is hard to find online influencers as training data is scarce (e.g., it has heen shown that “an expert can only recognize no more than 200 fashion influencers on @Twitter over a 3-week period of time”). The researchers proposed a human computation approach that crowdsources the task of finding social influencers in the form of open-ended question-answering: the crowd is asked to name as many social influencers as possible in two domains (fashion and information technology), and, by then aggregating the answers from a large number of crowd workers, they were able to “identify the identities of a large number of social influencers in an efficient and cost-effective manner.” To pratically do that, the researchers published question-answering tasks on @FigureEight, asking workers to name social influencers (@Twitter usernames) they know. Then, a human computation approach aggregates open-ended answers while modeling both the quality of the workers’ answers (based on, e.g., each influencer’s number of followers and topical relevance) and their reliability. In the future, it would be cool to see how the approach generalizes in domains other than fashion and infotech.

Challenge 2. Tackling misinformation

Characterizing Search-Engine Traffic to Internet Research Agency Web Properties by @alexanderspangh, @besanushi, @adamfourney, @erichorvitz of @MSFTResearch & co. The Internet Research Agency (IRA), a Russia-based company focused on media and information propagation, “carried out a broad information campaign in the U.S. before and after the 2016 presidential election”. The researchers focused on IRA activities that received exposure through the browser Internet Explorer and the search engine Bing. The interesting bit is that “a substantial volume of Russian content was apolitical and emotionally-neutral in nature.” Then, IRA used that as a “backdoor”: such neutral content “gave IRA web-properties considerable exposure through search-engines and brought readers to websites hosting inflammatory content and engagement hooks.” It is clear that, to counter such campaigns, cross-organization collaboration among tech companies, news organizations, and political bodies will be badly needed in the very near future :)

A Kernel of Truth: Determining Rumor Veracity on Twitter by Diffusion Pattern Alone by Nir Rosenfeld & co. “Recent work in the domain of misinformation detection has leveraged rich signals in the text and user identities associated with content on social media. But text can be strategically manipulated and accounts reopened under different aliases.” Instead, “can the veracity of an unverified rumor spreading online be discerned solely on the basis of its pattern of diffusion through the social network?” The answer coming out of this paper is a resounding “Yes!”, even in the early stages of propagation.

Challenge 3. Data4Good — tech for societal Issues

FIMS: Identifying, Predicting and Visualising Food Insecurity by @Georgi2anna & co. The UK government does not measure food insecurity. The authors resorted to data from the popular peer-to-peer food sharing application (@OLIO_ex) to determine national estimates of food insecurity:

Predicting Drug Demand with Wikipedia Views: Evidence from Darknet Markets by @sam_miller93 @josswright @dekstop @oiioxford @geoplace of @turinginst @oiioxford @warwickuni. Policy makers currently rely on annual surveys to monitor illicit drug demands (e.g., of the addictive Fentanyl), but surveys are too infrequent to detect rapid shifts in drug use. The authors tried to answer the following question: can we build a high- frequency measure? The first issue they faced concerned data availability. The authors took high-frequency sales data from the four largest DarkNet markets (Alphabay, Hansa, Traderoute and Valhalla) during June and July 2017, which covered 80% of global trade at the time. The authors then framed the demand prediction problem as an adaptive nowcasting problem, which basically means that, to predict june 2017, they used a variable reflecting the time window [beginning_of_study,april 2017], and another variable reflecting the previous month [may 2017]. It turned out that DarkNet data wasn’t enough to predict demand, particularly for less popular drugs. The authors’ cool intuition was to augment their DarkNet data with monthly Wikipedia page views, and that greatly improved predictive accuracy and allowed for succesfully detecting newly emerging substances. This step is not entirely obvious. Computer scientists generally resort to Google Trends to build proxies for web searches (i.e., for interest). The authors tried that out and found that “Google Trends may be problematic due to anguage ambiguity for drugs. For example, a search for Magic Mushrooms could feasibly be expressed as “mushrooms”, “shrooms”, “magic shrooms”, or “truffles”. Wikipedia is much simpler as there is a set page for each drug.” Pretty cool work — it might well help policy makers to predict the next opioid epidemic!

Quantifying Community Characteristics of Maternal Mortality Using Social Media by @red_abebe, @sal_giorgi, @anibuff, & co. In recent years, maternal mortality has increased in US. The authors used County Tweet Lexical Bank to group tweets at user level. They then limited their dataset to Twitter users with at least 30 posts and U.S. counties with at least 100 such users, which left them with 2,041 U.S. counties. Pregnancy-related tweets were mainly morning sickness, celebrity pregnancies, and abortion rights. The authors found that rates of mentioning these topics predicts maternal mortality rates with higher accuracy than standard socioeconomic and risk variables such as income, race, and access to health-care: less trustful, more stressed, and more negative affective language is indeed associated with higher mortality rates.

Analyzing the Use of Audio Messages in WhatsApp Groups by Alexandre Maros, Jussara Almeida, @fbenevenuto, and Marisa Vasconcelos. The authors collected and transcribed audio messages in public WhatsApp groups, and found that the ones that were shared the most were those with a sadder connotation and closely related to family, work, time and money. They also found that a significant fraction of such messages contained misinformation.

Facebook Ads as a Demographic Tool to Measure the Urban-Rural Divide by Daniele Rama, @yelenamejova, @mtizzoni, @KyriakiKalimeri, & @ingmarweber. This is a clever use of the Facebook Advertising platform (to which every Facebook user could have access to). The authors smartly used it to offer a digital “census” of the entire country of Italy. As one expects, the behavioural signals extracted from Facebook likes are far richer than those coming from any census data: rural areas show a higher interest in gambling, while urban areas show a higher interest in health-related behaviors, in cooking at home, and in exercising. The project comes with a beautiful map where you can explore a variety of behavioural facets, including the fraction of Android vs. iPhone users, and how that fraction relates to wealth indicators ;)

Towards IP-based Geolocation via Fine-grained and Stable Webcam Landmarks by Zhihao Wang & co. I’m not 100% sure this belongs to the “Data4Good” challenge but I liked the approach anyhow. They built a system called GeoCAM, which “periodically monitors those websites that host live webcams and uses the natural language processing technique to extract the IP addresses and latitudes/longitudes of webcams.”

Challenge 4. Text: King of the Conference (i.e., how to make text processing a bit smarter)

What Sparks Joy: The AffectVec Emotion Database by @shahabraji & @gdm3000. AffectVec is a new emotion database providing graded emotion intensity scores for English language words with regard to a fine-grained inventory of 239 emotions, which is orders of magnitude more than previous work. It is available for download at http://emotion.nlproc.org

Don’t Let Me Be Misunderstood: Comparing Intentions and Perceptions in Online Discussions by Jonathan P. Chang, @jcccf of @fb_research, @cristian_dnm. The central question here is “How does misalignment between intentions and perceptions play out in online conversations?” This is an hard question to answer. To simplify it, the authors smartly divided information into two types: facts and opinions. So the question becomes “how does misalignment between intentions and perceptions of a fact/an opinion play out in online conversations?” The first step is to get intention data. They did so by collecting posts on public Facebook pages. Since one could share or seek information, posts are arranged into for types: fact sharing, opinion sharing, fact seeking, and opinion seeking. The authors designed two surveys, one for post initiators to ask them what they intended with their posts, and the other for post repliers to ask them how they perceived the posts they replied to, and ended up with 9K initiator responses and 9K replier responses (getting such responses is a key step and hard to do unless you have access to the initiators and the responders, as Facebook did in this case). They then answered three questions:

What kinds of misalignment exist? The results say that, as one expects, facts are more likely to be misperceived for opinions than vice versa (I would call it “the opinion bias”).
Which linguistic cues signal misalignments ( of intentions and perceptions)? It turns out that the use of “maybe” (and, in general, hedging) is associated with opinion sharing, while the use of cardinal numbers is associated with fact sharing.
Do misalignments affect future incivility in conversations? Yes, but only when “intended fact sharing” is misperceived as “opinion sharing”.

Overall, the results are very promising — algorithms to detect misperception could soon be deployed on social media to identify risky conversations or warn the users they might be misperceived. A near real-time version could even help Downing Street/White House press briefings [my opinion]. On a more serious note, I really liked this work as it gets to the heart of what a conversation really means — in fact, the way a conversation is perceived is mostly a function of the intention behind it vs. the impact of it.

Leveraging Sentiment Distributions to Distinguish Figurative From Literal Health Reports on Twitter by Rhys Biddle & co. Social media could be used to monitor health events for public health surveillance. An important issue is how to identify figurative usage of disease words. The authors used sentiment information to do that.

Challenge 5. Beyond text: the role of time in conversations

Bursts of Activity: Temporal Patterns of Help-Seeking and Support in Online Mental Health Forums by Taisa Kushner and @amt_shrma of @MSFTResearch. Using data from @TalkLife, a mental health platform, the researchers found that “user activity follows a distinct pattern of high activity periods with interleaving periods of no activity”. That’s why they proposed “a method for identifying such bursts & breaks in activity.” This is important because past research has analyzed the same data, but has done so both at the level of user post and over arbitrary period of time (months or years), and this work now shows that both aspects do not reflect actual usage. The researchers then analyzed under which circumstances @TalkLife was effective (that is, when users in need actually received social support). To quantify @TalkLife’s effectiveness, they designed two measures:

Moments of cognitive change based on previously built classifiers.
Self-reported changes in mood during a burst of activity (in @TalkLife, users must select one of 59 “moods” when posting, which are then grouped into six categories by the platform).

The results in this work showed that, as opposed to what has been done so far, considering bursts as a natural unit of analysis makes a considerable difference in predicting when @TalkLife worked and when it didn’t.

Herding a Deluge of Good Samaritans: How GitHub Projects Respond to Increased Attention by @danajat, @cerenbudak, Lionel Robert, & @danielmromero. The main question the authors went after was “how collaborations among coders (open source software crowds on GitHub) change in response to an increase in external attention (an attention shock)?” To answer that question, they analyzed “millions of actions of thousands of contributors in over 1100 open source software projects that topped the GitHub Trending Projects page and thus experienced a large increase in attention, in comparison to a control group of projects identified through propensity score matching.” They basically resorted to two causal mechanisms, in that, they not only constructed a comparison group of repositories using propensity score matching but also tested their hypotheses using difference-in-difference. In testing their hypotheses, the authors found that trending GitHub crowds actually respond or adapt similarly to successful real-world organizations. Also, the authors pointed out two interesting research questions for future work:

Outside contributors do engage but in shallow ways, then the question is “How can core members make the most of such increased resources?”
During attention shocks, a backlog of pull requests and issues pile up. The question is “How can crowds being impacted by attention shocks prevent low responsiveness from running off arriving outsiders?”

War of Words: The Competitive Dynamics of Legislative Processes by Victor Kristof (@victorkristof). While in the making, laws get changed by parliamentarians. Victor gave a very nice talk about their work on: 1) curating a dataset of 450K legislative edits introduced by European parliamentarians over the last ten years (which is available on GitHub); and 2) predicting the success of such edits by modeling two factors: inertia of the status quo, and the competition between overlapping edits. By looking at the resulting parameters of their model about a given law, one could determine which parliamentarians influenced the law’s formulation and the extent to which the law was controversial. It turns out that EU parliamentarians couldn’t care less about a law on transportable pressure equipment (on the left of the figure below) and cared quite a bit about GDPR (on the right). Who said that we don’t care about privacy?

What Changed Your Mind: The Roles of Dynamic Topics and Discourse in Argumentation Process by Jichuan Zeng & co. The researchers proposed a neural model that captures the topic shifting and discourse flow in an argumentation conversation and, in so doing, their model “not only identifies persuasive arguments more accurately, but also provides insights into the usefulness of topics and discourse for a successful persuasion.”

Challenge 6. Which new forms of interaction will win? From bots to viz.

Dynamic Composition for Conversational Domain Exploration. @GoogleResearch folks built a new bot named “conversational domain exploration” (CODEX in short). The use case is clear: the user enriches her knowledge of a given domain by conversing with the bot. The bot was integrated into Google Assistant. “As an example domain, the bot conversed about the NBA basketball league in a seamless experience, such that users were not aware whether they were conversing with the vanilla system or the one augmented with our CODEX bot.”

Don’t Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing. @Amazon folks proposed a new architecture to answer both simple and complex queries (unlike previous approaches, they don’t require a query to be representable as a parse tree).

ShapeVis: High-dimensional Data Visualization at Scale . Finally, @AdobeResearch folks proposed ShapeVis, a new scalable visualization technique for point cloud data inspired from topological data analysis. It captures the underlying geometric and topological structure of the data in a compressed graphical representation. The authors empirically demonstrated how ShapeVis captures the structural characteristics of real data sets by comparing it with “Mapper” using various filter functions like t-SNE, UMAP, LargeVis. They showed that ShapeVis scales to millions of data points while preserving the quality of the data visualization task.