The risks of recommendation engines, automated newsfeeds and the commercialisation of personal information

are you inside, or outside the bubble? source

I will try to raise your awareness on how the use of ranking algorithms in search engines and recommendation engines for products and services impacts our society and our rights, directly and indirectly.

Why do online companies care so much?

Online companies are obviously in the business of making money, by selling products or services. Before these companies sell anything they will need to display an offer of some kind, this offer may or may not initiate a transaction with a customer. Often online businesses have little time to do so because a site visit is typically very short and for every online retailer there is a plethora of alternatives. Hence, it is crucial to make the most out of that brief encounter, not in the least because each encounter costs money, whether it is through advertisement or through an actual site visit. Basically, to maximize revenue, online businesses aim to maximize the following metrics:

  • conversion rate,
  • revenue per visit,
  • return rate, and
  • click-through-rate in combination with cost-per-click.

In other words, regardless of the goals/motivations of the website visitor, the algorithm wants to maximise

  • the probability that you click on, the most expensive, ads,
  • the probability that you purchase, the most expensive, products,
  • the probability that you visit the website again, and
  • the time that you spend on the website.

which may be conflicting with your goals. In fact, it may be conflicting with the interests of society as a whole. I hope this conflict will be clear by the end of this article.

The main purpose of the rather euphemistic term personalised communication is not to improve your life but to maximize the effectiveness of advertisements or to increase your exposure to advertisements, of commercial, ideological or political nature. The natural result is that an increasing amount of information that you are exposed to is primed to activate consumerism and is designed to be easily excepted by your subconsciousness, which, in combination with personalised recommendation engines that maximize the probability of user engagement leads to a new type of information addiction.

Furthermore the technologies and methodologies used to facilitate and control personalised-communication activities can be easily used by benevolent actors to undermine our autonomy by limiting and controlling the information that we are exposed to, to suppress the freedom of speech by automated censoring, and to degrade the pluralism and heterogeneity of democratic societies by the creation of filter bubbles or the viral spread of political propaganda, without us knowing.

Freedom of speech on the decline

Ironically, the freedom of online speech has been on the decline since the emergence of social media. The rise of social media and the ease with which dissonant societal views are spread and amplified have catalyzed, if not facilitated, the Arab spring. The recognition that citizens could be politically mobilized relatively easy by unregulated online communication, has resulted in governmental interventions, from nation-wide blocks of social media (Turkey) to a 'nationalised' firewalled internet (China). Less noticeable and more gradually, increase exposure to dissonance combined with a fear of persecution and a liability of large media platforms has resulted in online hatespeech watchdogs who are being erected under the guise of ‘safe spaces’. A disturbing example is the BBC who have assumed the position of judge, jury and executioner when it comes to free speech on their platform, I quote from their cookie policy (2017):

If you post or send offensive, inappropriate or objectionable content anywhere on or to BBC websites or otherwise engage in any disruptive behaviour on any BBC service, the BBC may use your personal information to stop such behaviour.
16 Where the BBC reasonably believes that you are or may be in breach of any applicable laws (e.g. because content you have posted may be defamatory), the BBC may use your personal information to inform relevant third parties such as your employer, school email/internet provider or law enforcement agencies about the content and your behaviour.

The Swedish state has gone as far to collect a list of online hate disseminators, featuring politicians, academics and journalists. Google has imposed implicit algorithmic censorship by restricting the display of advertisements next to 'controversial' content, clearly frustrating the freedom of information and the freedom of speech.

In Austria a man was fined for 'liking' a slanderous comment on Facebook, the judge stated that

the defendant had failed to prove that the comments he had liked on Facebook were true.

In Germany they have the so-called Netzwerkdurchsetzungsgesetz, a very recent law that obliges social media platforms to remove hatespeech and fake news, within a day. The effect is obvious, for practical reasons the social media platform are forced to use a broad brush, significantly curtailing the freedom of speech in an extra-judiciary manner.

Now we have arrived at a point that the free and unhindered access to, and the spread of, information is threatened from yet another angle; online media platforms and content providers that are primarily driven by economic motives have been operating in a legal and moral vacuum for the past decades. In this period a world-wide infrastructure has been constructed to store, enrich, process and trade personal data. Furthermore, it has been demonstrated numerous times that this infrastructure is vulnerable to attacks by governmental and private groups. The commercial actors that are responsible for the generation of this data have been shown to side with governmental actors sooner or later. In essence, in the last two decades the private sector created a global interconnected surveillance infrastructure that can, in practice, be utilised by governmental organisations at will.

In the meantime governmental organisations do little to curb this process of increasing commercial control over personal information, in fact they would rather facilitate it because it represents an economic opportunity.

The algorithms used to generate information feeds are as of yet primarily focused on exploiting the weaknesses of the human mind, from cognitive consonance to confirmation bias. Most importantly, the information sharing platforms are now owned by relatively few, incredibly large, entities that have turned information sharing into an economic activity: from gathering, selling and reselling personal information to disseminating news as the precursor for displaying advertisements which undermines the role of the media as a societal and political watchdog. What is more, the data that is collected is enriched by automatically inferred statistical models for the prediction of personal behavior and preferences. It is my view that these inferred models should be categorically included in privacy legislation.

I also see another scenario, the liberation of the vast amounts of data that is floating online and that combined with open source data analysis tools and pro-rata computational resources serves the discovery of truth and knowledge, the protection of democracy and the enforcement and protection of fundamental human rights. This new type of information retrieval may not only be a means to empower all levels of society but can also form the breeding ground for real intellectual and entrepreneurial collaboration of people around the world. All of that with respect for your right to a personal life. I think it is time to get started with fulfilling the promise that the internet once held.

Important ideas you should be familiar with..

I think that laying out the vocabulary of this topic is a demonstration of and in itself. For each term I will explain the relevance and I will give examples.

Privacy

the fundamental human right of any natural person to have and to protect a personal life

There seems to be great confusion about the importance of privacy. A common response to privacy-related issues is

"I have nothing to hide.." — Nobody

When I write Nobody I mean that logically this is a person that either has no personal life, or this is a person that does not consider his or her personal life worthy of protection. In the former case this person is either enslaved by a master that does not allow anyone to be distinct or this person is voluntarily enslaved to conformity. In the latter case, a voluntarily renouncement implies that this person sees him or herself as a nobody, which is perhaps as much sad as it is tragic.

A common error is that people only look at their own situation when dismissing the importance of privacy, where things like social and sexual taboos are merely vocal expressions of discontent and not, for instance, physical harm, repudiation or even persecution. In the free west, we have little notion of such repercussions and are very rarely made aware that such a reality exists. However, even within the relatively safe borders of western societies there is the common risk of 'losing face', the degradation of your social reputation or even the risk of losing your job as the repercussions of a free expression. Hence, even in a country such as the Netherlands it is crucial that you are aware of and in control of the manner in which your expressions are shared. Privacy is not just about protecting your personal life, it is also about protecting your ideas and how they are disseminated.

Another important reason for the protection of your personal information and information regarding your personal behavior is the possible abuse by benevolent actors, or the misuse by incompetent actors, which will be elaborated in this article.

When I write the term benevolent actor I am well aware that it immediately reduces the amount of readers by 90% but I feel that we should at least hypothesize the emergence of evil doers that may do bad thing where perhaps the reader would do good things. Unfortunately we live in a world where such misanthropic evil doers are present in all layers of society, so, for the sake of argument

hope for the best, assume the worst.

Privacy is not only crucial for the free exercise of the freedom of speech but also for the free dissemination of information. It is, one could say, crucial for the exercise of oneself and the exercise of the collective intelligence in a democracy. Furthermore, even if it does not apply to you, or you do not feel the urgency to defend this right, then please be aware that this right most certainly applies and is most certainly urgently needed for: whistle-blowers, political commentators, journalistic sources, counselors, comedians, cartoonists, medical doctors and criminal witnesses, to name a few.

Also see this great post:

How to influence people

I will shortly discuss the psychological weaknesses that are innate to all human beings. These weaknesses can be harmless were it not that these weaknesses are actively exploited, by politicians, marketeers and ideological evangelists.

Cognitive consonance: the established theory that being exposed to information you recognise or opinions you agree with is accompanied by a positive sentiment, and an actual physiological response to that effect, and vice versa for opinions you do not agree with. This is the basic mechanism that leads to most of the cognitive biases.

Cognitive bias: a systematic pattern of deviation from rationality in judgment, leading to perceptual distortion, inaccurate judgment, illogical interpretation

There are several types of cognitive bias, especially relevant for online information consumption are

  • confirmation bias: the tendency to selectively look for evidence that supports your point of view, ignoring alternative explanations and opposing evidence. This is driven by cognitive consonance.
  • salience: the tendency to focus on the most distinct, the most salient feature, in an image, or a text, to form an opinion. This is driven by the inherent difference in energy requirements between forming an instinctive and a rational opinion. Whereas the former is produced almost immediately and without much effort because we simply attach associations based an preconceived notions, the latter requires some level of contemplation, perhaps even introspection and in the worst case even an alteration of our prior knowledge. We are instinctively drawn to the most salient features of any type of information.
  • conservatism bias: the tendency to give more weight to prior evidence than new evidence. I.e. the first evidence presented to you will have a stronger effect on your opinion, all other things considered equal.
  • anchoring bias: the extremum of conservatism bias, where people tend to rely heavily on the first evidence that is presented. This is related to priming whereby an initial impression will influence the interpretation of following impression.
  • bandwagon effect: the observation that popularity/normalcy/acceptance has a self-enforcing effect whereby the probability of adoption increases with the increased actual adoption. In part due to a network effect where the probability of individual exposure increases with the actual exposure. In part due to the tendency of people to conform to governing opinions without considering evidence.
  • clustering illusion: the tendency to underestimate the amount of variability, and overestimate to amount of clustering, i.e. false pattern recognition.This can lead to the Texas sharpshooter fallacy whereby similarities are stressed and differences ignored.
  • selective perception: the tendency to easily forget and not even perceive information that is discomforting or contradicting with prior beliefs. This is strongly related to confirmation bias, being opposite in nature.
  • Mere exposure effect: the tendency that mere familiarity with information is associated with a higher likelihood of preference. This is perhaps the most important effect for propaganda of any kind. This is strongly related to cognitive consonance, as exposure leads to a confirmation of prior belief when exposed to the information again.

A more complete list can be found here.

The takeaway message here is that we have cognitive blind spots that can be exploited by presenting information in a certain way and in a certain order. We cannot be aware of this continuously simply because the ground state of our brain is focused not on rational processing but on instinctive processing.

Priming: the idea that exposure to one stimulus will influence the response to another, following, stimulus

A common application would be to display ideological or commercial advertisement immediately after the required sentiment, say anger, pleasure, sadness. Indeed, the order in which I lay out my points will influence your perception of this topic. For instance, I could start with a heart-wrenching story that demonstrates an abuse of personal information and then relate back to that example throughout the text, inversely, if I wanted to persuade you of the benefits the initial story would exuberantly positive.

Framing effect: the idea that the type of presentation of ideas will change the perception of the ideas themselves, this is an example of applied cognitive bias.

This is an example of applied cognitive bias and relates to psychological priming. Whereas priming is sequential, framing is simultaneous.

You can see framing as an artificial context which is attached to the ideas for the sole purpose of manipulating your perspective and with that, the probability that you dislike/like, accept or refute, the proposition. On a more instinctive level the artificial context triggers associations which we attach to the framed proposition.

Herd effect: the idea that we tend to believe the governing opinion or the majority vote.

The effect of collective intelligence or wisdom of the crowd improves the likelihood that a group decision is the most correct decision. However, this can only occur if a diverse group of people give independent estimates for a logical problem. The sense of collective intelligence might explain why, as individuals, we put so much trust in majority votes.

In reality, the individuals in a crowd are not independent, they are clustered in groups and often the issues that filter through the crowd are not of a logical nature at all, but rather of a complex societal or even ideological nature. Hence, the herd effect is the false believe in a universal validity of this collective intelligence.

This false believe is easily exploited by marketeers and politicians by implying normality and commonality. This ties in closely with groupthink but it is not confined to specific groups. Basically, by implying normality, as in the majority concurs with a certain stance, a general groupthink is activated and the likelihood of acceptance is increased.

Tribes: online communities of people that share common interests and/or ideas

Off line and on line, people flock towards liked-minded peers, for fraternisation and self-identity. You might call tribes the online equivalent of real-life clubs.

Of course, identifying to which tribe you belong can be incredibly powerful for governments, retailers or insurance companies, for targeted advertising, and profiling.

Facebook friend tribes, source

Advertisement slang

The 'Mad men' have their own terminology that is very enlightening. Let me elaborate some of them.

Clickbait: snippets of distracting information that activate curiosity and lead to paid content

Clickbait is a pejorative term describing web content that is aimed at generating online advertising revenue, especially at the expense of quality or accuracy, relying on sensationalist headlines or eye-catching thumbnail pictures to attract click-throughs and to encourage forwarding of the material over online social networks. Clickbait headlines typically aim to exploit the “curiosity gap”, providing just enough information to make readers curious, but not enough to satisfy their curiosity without clicking through to the linked content.[1][2][3]
From a historical perspective, the techniques employed by clickbait authors can be considered derivative of yellow journalism, which presents little or no legitimate well-researched news and instead uses eye-catching headlines that include exaggerations of news events, scandal-mongering, or sensationalism.[4][5] — Wiki

BuzzFeed, Gizmodo, theatlantic, bbc.com, or almost any free source for online information will likely see pictorial article recommendations layed out in blocks or strips with enticing images and a headline that sparks your curiosity. The curiosity cliffhanger is referred to as the ‘curiosity gap’. Often completely unrelated to the actual content, and often with no relation of the headline or the image with the underlying content. The blocks/strips of advertisements mixed with actual articles will be preceded by a statement like 'recommended by…' or a variation thereof.

Typical phrases contain the following snippets: top lists, surprising facts, largest/smallest/fastest/prettiest/deadliest, trending now, just in, something shocking, something disturbing, you have to see this, you will not believe what happens next, etc.

Facebook announced in August 2014 that they would tackle the clickbait issue algorithmically, and again in August 2016. This of course, is a form of algorithmic censorship. In fact, it will categorically censor any article that is written in a style similar to actual clickbait articles.

Native advertisement: a type of advertisement that is embedded in the content

Described by the ad providers as a non-disruptive means to engage the customer, it basically is advertisement blended into the content.

While many brands counted on traditional display ads in the past, they’ve come to realize that native ads garner much higher CTAS. In fact, reading a native ad headline yield 308x more time of consumer attention than processing an image and banner. — Outbrain

Taboola, Outbrain, Gravity, Revcontent, Newsmax and also Google adsense are business-centric services that aim to maximize the number of impressions and the conversion rates of articles. This particular means of advertisement is relying more and more on clickbait-type announcements. I refer to articles as advertisement for the very simple reason that informative articles are presented in the form of advertisements and do not necessarily enrich the content or are even related to the content. They are promoted similarly as advertisements for products or services and their placements are based on maximization of click-through-rate, and not maximization of relevance. This means for instance that clickbait advertisement is shown relatively often, as this type of advertisement is more effective.

Of the 1 billion user milestone, Adam Singolda, Taboola CEO, told Real-Time Daily: “We believe there is a ‘winner takes it all’ market when it comes data. It’s either you know the person behind the screen, or you don’t. Knowing if someone is a video fan, or if someone tends to subscribe to things, are binary questions that enable publishers to drive true personalization on their sites.”
Singolda explained that while Facebook has amassed a huge amount of data, Taboola’s goal is to draw from its own “trove of information about how people consume content across the Web to empower publisher partners to leverage personalization technology and free, anonymous, actionable user data to build audience, engagement and revenue.” — Adam Singolda, Taboola CEO (source)

Companies like Outbrain and Gravity do more than just display ads, they also provide recommendations of the on-site content, i.e. part of the information feed is outsourced because (good) data scientists are expensive. It is easy to see how this can create an information asymmetry: suppose that this information feed is outsourced to relatively few companies, they can then decentrally nudge a large portion of the online populus towards specific concepts and ideas.

To feed the algorithmic suggestions of these adproviders, huge amounts of personal data are continuously being stored and processed, what is not stored can be bought from data brokers. As the click-through-rate, and with that the revenue, increases with more accurate recommendations and targeting, personal information itself has monetary value since that is the fuel for the recommendation system. This added value is the single most important reason that such data is stored, processed and traded on international markets, representing over $200 billion in 2016.

Native advertisements, or advertisements that are not distinct from the content they accompany, were frowned upon not that long ago. Google adsense required from the publishers that they clearly indicated the advertisements were in fact advertisements and not neutral/unbiased content. This went as far as coloring schemes that had to be distinctly different. There would be occasional checks, and repercussions if you did not comply. Now the two largest search engines, Google and Bing, apply obfuscated advertisements on their own page, appearing inline as both the first and the last results.

Ad-providers/publishers have been tempted by the higher cost-per-click and conversion rates of native advertisement to violate the basic rule that advertisements and content should be strictly seperated in order to avoid confusion and deception of the visitor.

An explanation for this move towards content-based marketing may be the shift in focus of Google's ranking algorithm, from keyword-based to content-based in August 2013. This new algorithm demoted websites with little original content and opened a market for automatically generated content. Using Markov models, or by simply randomly concatenating the content of existing websites, or even by applying online translation of content from foreign-to-native language. This fake content would facilitate the display of advertisements on so-called parking pages. Another shift that took place in 2013 was a range of algorithmic changes for the Facebook newsfeed that rewarded good click-through rates with higher rankings. Combine this with the rise of Upworthy.com that was ridiculed for it's clickbait headlines and then copied by competitors because such headlines were actually very effective in increasing click-through rates and you basically have the ground zero for fake news and clickbait.

Another explanation is that consumers seem to accept native advertisement as long as the ads are somewhat relevant. This user-acceptance is, however, influenced by the same psychological mechanisms as what is used for marketing. Gradual exposure will simply increase the user-acceptance in time.

Sponsored content: advertisement guised as a neutral descriptive article or an editorial

Also called an advertorial. This is very closely related to native advertising, but more focused on providing information than on persuasion, basically it is an infomercial. This is quite common for free mass media, also the paper kind, you might find such a sponsored article in the Metro newspaper for instance. Although this type of advertisement goes further in terms of intrusion it is (1) much less common than native advertisement, (2) it is more distinct from the source content so the user is more aware that he is looking at an advertisement.

I already mentioned adplatforms such as Taboola and Outbrain and the way they integrate recommendations with the content. As part of their algorithmic recommendations they add advertorials to the selection. In fact, these advertorials can even dominate the selection, which is only obvious if you scan for the term ‘sponsored content’. Again, the adproviders/platforms have gradually moved away from the principle of clearly distinguishing advertisement from content, for the obvious reason of increased revenues.

Remember, this is a profit-driven business. If showing three pictures of half-naked people, a monkey on a jetski and an advertorial for penis enlargement is demonstrated to maximize ad-revenues it happily displays that, even if the content is e.g. about a church renovation. The reason, again, is that these ad-providers are business-centric and the algorithms maximize simple revenue-based metrics.

Subliminal messaging: displaying a message such that it is not consciously perceived by the receivers

Subliminal messages are by definition non-transparant as the user is unaware of an advertisement being presented. Even though the message is unconsciously perceived, the perception triggers an instinctive response.

Subliminal advertising is forbidden in some western countries, e.g. the United Kingdom, and for a very good reason: by exposing citizens to subliminal messages their perception of reality is subconsciously and involuntarily altered. Hence it is a fundamental violation of the right to self-determination and a violation of the right to a personal life.

In essence, most of the advertisements I have discussed, are forms of subliminal messaging, even if the ad itself is visible, because the neural processes that determine your response are physiological in nature and occur subconsciously. For example, simply being confronted with a brand in combination with a positive sentiment will increase the likelihood that you will have a positive sentiment when confronted with that brand later on and even the exposure itself, regardless of the association, will increase that likelihood. The reason that common advertisements of say the billboard type are accepted is that it is clear to anyone viewing the advertisement that it is meant to convey some kind of promotional information with the intent to persuade you. Even if there are subconscious processes that steer your preferences at the very least you are aware that this might occur. With native advertisements ad providers are dodging the proverbial legal bullet by mentioning the fact that an advertisement is presented when it is clear that the way in which they present this notification is too inconspicuous to create awareness with the user; an example is shown below, here in the top-right part of the advertisement reel you see an indication that these links refer to advertisements. Besides the small size of the indicator, it is well known that in graphical interfaces the top-left attracts the most attention and the top-right probably the least, also the competing terms From The Web are bold faced and with a larger font-size. This ties in closely with the aforementioned information addiction, a habit which develops inconspicuously.

Ad reel on theatlantic.com

Even more inconspicuous is the conflation of on-site articles with actual paid links under the umbrella term “promoted links” as is done in the following example on theguardian.com;

Ad reel on theguardian.com

This poses a moral hazard, because the uncontrolled response following this unconscious ad-exposure forms a reduction of our autonomy. When you consider each individual instance, this reduction of autonomy is benign but when the user is repeatedly exposed to these inconspicuous advertisements it can be used to alter behavior, on a large scale. Realise that these ad platforms, Revcontent, Gravity, Taboola and Outbrain can reach billions of users. It is easy to imagine that the ad platforms can be swamped with click-bait type fake-news headlines where a state actor funds the initial clicks such that the ad networks pushes the headlines through their network.

Location-based advertisement: targeted advertisement in the context of your current location

Using a technology that triangulates the signal strength of your bluetooth or WiFi signal your location in shopping malls, streets, or in stores is tracked, stored and sold. This basically means that your phone’s mac-address is logged and processed. The resulting behavior is sold to data brokers, marketeers and shop owners, for advertising and for shop layout/inventory optimization. In fact, this happens in realtime so that just-in-time marketing can take place. Location-based advertisements are a form of targeted advertisements.

Targeted advertisement and behavioral targeting:

Basically the commercial or ideological application of the psychological phenomena that I discussed earlier to more effectively push or nudge a user towards certain behavior; i.e. confirmation bias, framing, priming and cognitive consonance. A more and more common practice right now is just-in-time marketing, marketing that is pinpointed to your very specific context in a very specific moment in time. This type of marketing relies on up-to-date information such as

  • your consumer characteristics
  • your recent consumption behavior
  • your location, direction of travel

Just-in-time marketing relies primarily on data that may be considered highly personal depending on the context. However, there is so much data available on your preferences that, combined with the available demographic data (such as your age, occupation etc.) and your contextual data (such as your location) that something like psychographic hacking becomes a possibility. The idea that given enough of the right information it can be inferred what type of information can be offered to nudge you in a particular direction.

Malvertising: the injection of malware through advertisements

From the Wikipedia-description:

“In 2012, it was estimated nearly 10 billion ad impressions were compromised by malvertising.”

How does this work? Either by malicious scripts that are activated once the ad is loaded or by using the media-interpreter as the transporter of a script that is hidden in the medium . The advertisements you see are types of digital media-formats. Each format has a type of interpreter that translates the binary content to the rich media that is displayed on your screen. Rich media formats such as .swf (i.e. Flash) is notorious for this reason, as it can also execute the malicious scripts. A media format such as .GIF can be used to transport malware onto your system and exploits in .JPEG and .TIFF have been used in the past to execute scripts.

The ad-platforms and the ad-publishers are partially to blame for displaying malicious advertisements, since they are responsible for vetting (or rather for not vetting) the ad-providers.

Leakware: a type of ransomware that hijacks your computer or software

Why is this in the list of definitions? Because this form of ransomware uses your personal information as leverage and because the infrastructure that is built to gather your personal information is, and cannot be, fully secure. Examples of large scale publicized hacks are

  • Yahoo, login information for 1 billion emails
  • Linkedin, passwords of 117 million users
  • JPMorgan, account information of 100 million customers
  • AdultFriendfinder, login information for 340 million accounts
  • Cloudfare; leaking encryption keys for large sites such as Uber, OK Cupid, Fitbit and about 3000 others sites, for several months

and so much more, captured in some awesome graphics on the following website:

Another use for personal information is demonstrated by the ransomware Spora that changes the ransom depending on whether they think you are businessman or not.

Spyware: software and hardware that is designed to covertly gather personal information

We should note that adware, web beacons and tracking cookies are actually types of spyware as hardly any web beacon or tracking cookie is explicitly approved.

The following documentary by Al-Jazeera illustrates why the average consumer should be weiry of sharing their real identity online, and actually sharing any type of unprotected sensitive information. In a nutshell, not only is your personal information worth money, there are dedicated developers of spyware to steal your private communication and sell it to the highest bidder.

Leakware, malware and spyware are all exemplifications of an inevitable function creep pushed forward by the monetisation of personal information. This monetisation, and the infrastructure that facilitates this monetisation, starts with the advertisement industry and more particularly the combination of data brokering, ad publishing and real time bidding.

Cookie wall: the concept of a cookie-acceptance requirement to access a website or a service

Direct examples are abound, at least if you go to websites for companies located in the EU. On the one hand it is an annoying result of legislation aimed at protecting privacy and on the other hand it demonstrates that such legislation is not enough to protect privacy because the choice to either accept cookies or have no site access at all is not really a choice.

Google leaves little choice…

The European Commission wants to go one step further. Instead of the requirement for individual websites to have opt-in cookies there will be a central browser-based cookie-switch. If the user decides to reject cookies in the browser, he will simply be denied access to websites that claim to require cookies. This means that to have access to individual sites that claim to need cookies one has to temporarily turn on cookies for all sites. Not surprisingly, the European Commission announced the regulations as a boost to the data economy. The new ePrivacy directive may still be amended or rejected following suggestions from the European Parliament, however the European Commission is not legally bound by the advice of the EP.

A proper regulation starts with the distinction between first and third party cookies, where the latter is not necessary for technical reasons and may result in personal data being stored, processed and sold by said third parties. Already, the first-party cookies, necessary for a normal user experience, can be placed without permission and if the proposal of the EC would go through unaltered it would de-facto do the same for third-party cookies.

Lobbyism in practice? source

A good indirect example is the often obligatory access to private information by cell phone apps; from your agenda and your contacts to your messages and even to access over your camera and microphone. In the case of Facebook this goes two ways, firstly you need to accept full access to your private information and secondly you need the app to get access to the webbased interface on difference machines.

Cookie syncing and Super cookies: Cookies that are persistent through the synchronisation of individual cookies

How does this work? Take for instance, the theatlantic.com, suppose that you are reading an article. While you are reading the article there is periodic communication between the following addresses (February 3, 2017):

  • ping.chartbeat.net: 1x1 .img file in the response, and request-query parameters such as the source, the address, identifiers, the genre and the author.
  • googleads.g.doubleclick.net: cookie identifiers.
  • edge.simplereach.com: keep-alive response, and request-query parameters such as the title, the author, the genre and of course identifiers.
  • quantserve.com: real-time bidding for ads, content information and identifiers.
  • krxd.net: a pixel.gif in the response, and request-query parameters with my browser, my operating system, my country and province/region, article information.
  • adnxs.com: my IP-address in the reponse as X-Proxy-origin, user identifiers and cookies.
  • scorecardresearch.com: information about the articles and identifiers.
  • nexac.com, openx.net, jivox.com, c3tag…and many more, believe it or not.

You might refer to these communicators as beacons, all of which are 3rd-party cookies, meant to receive and broadcast your user activity and to facilitate the publishing and auctioning of advertisement space. The basic operation is that the host (e.g. theatlantic.com) performs an HTTP request that contains consumer/meta/content information and then it receives either an empty response or a small image with some identifiers in the url (probably to establish a trail of breadcrumbs).

Connecting with 11 websites through the browser led to a connection with 73 third-party sites. Tool used here is Redmorph (for chrome), Lightbeam is a similar tool for Firefox.

In the more extreme case special HTTP-headers are injected in your HTTP-requests. This can take place due to (malicious) software that re-routes your traffic through a proxy (a man-in-the-middle attack), or directly at the internet service provider. The latter was basically the exploitation by an advertisement company of a permanent header called X-UIDH that was injected by the ISP Verizon, who themselves used this header to track customer behavior.

Let me write this out a bit clearer, whenever you visit a website that contains such beacons there is periodic communication between multiple third party servers that register what you are watching or reading, and that potentially synchronises the cookie with other cookies based on your IP address, your email-address used to login, or identifiers that persist over multiple websites and multiple sessions because you are using e.g. a generic 3rd party login script (like the login script from Facebook and Google). This implies that, in principle, all of your online activity in relation to websites which contain such beacons can be centrally monitored and it does not matter whether you delete your cookies since persistent identifiers such as IP addresses are used.

This data regarding your online activity is actively processed by data brokers and marketing companies to build a personal profile that is either sold to 3rd parties (like telco-providers) or used directly for recommendation services.

Stateless tracking ("Fingerprinting") :

Supercookies can be described as stateful trackers, requiring back and forth communication to establish an identity. There is also stateless tracking that establishes the "fingerprint" of a users device/software. Where supercookies can at least be detected, stateless trackers are passive.

According to Mayer & Mitchell, 41st Parameter/AdTruth, BlueCava, and Iovation use fingerprinting to track users. This technology has found another usage by malware distributors because they can use fingerprinting to detect so-called honeypots, the machines used by malware investigators.


What are the issues with automated newsfeeds?

Filter bubble: the concept of being shielded from opinions and information that do not resonate with your own personal beliefs, see this Ted talk by Eli Pariser.

The filter bubble is supported by anecdotal evidence and increasingly by empirical data which suggests that we receive information from a decreasing variety of source through social media and that cross-cutting information is even suppressed. Other research shows that this can generate echo chambers, wherein like minded individuals reinforce their preconceived ideas. This Pew survey seems to suggest that people on Facebook are exposed to a diverse set of opinions. However, the fallacy here is that the mere concept of diversity is dependent on the context and the individual. If individuals state that they have received diverse opinions they are considering diversity from their own frame of reference. A similar fallacy is at work in the paper by Boxell et al. plus they consider a time period up to 2012 when the use of recommendation engines for news articles was not at all common. Most importantly, in the case of Boxell et al., they consider completely different age groups and it is well known that political inclination changes with age and in fact becomes more polarised with increasing age.

Pew

Furthermore, we need an objective diversity measure from only one frame of reference for the simple reason that then and only then are we able to study it’s evolution over time and can results by compared quantitatively with other research. A recent publication denies the existence of a filter bubble merely because compared to non-social media users, social-media users were more likely to be exposed to news from two sides of the political spectrum. This completely ignores the evolution of those figures for the different ideological subgroups. The hypothesis of the filter bubble, and it’s underlying mechanism suggests that the ideological distribution becomes more polarised, i.e. more flat in the middle and higher towards the edges, i.e. the evolution of the following distribution can make or break the filter bubble hypothesis. The below figure shows that more polarised information is shared more often which is to be expected due to the importance of salience on user response.

Alignment of opinions among information sharers on Facebook, source

Another important point that is often missed in the filter bubble discussion is that the filter bubble is the result of personal preferences of the individual. If the individual has a broad preference of topics, his or her filter bubble will obviously be larger. This means that when you define different groups based on their efficacy with social media you should make sure that you are not making a pre-selection in terms of broadness of interest.

Another Pew report contradicts the assertion of an overall diverse exposure and shows that media landscape for conservatives and liberals is qualitatively different. Another fallacy, or weakness, of these and other reports is the static nature of their results. When studying the presence of a filter bubble one should not look at the present state of the exposure and it's diversity measure but rather at the evolution of those metrics. Most importantly, the filter bubble does not constitute an information sphere with an impermeable border, but it does constitute a semi-permeable border that is more likely to allow information that increases cognitive consonance.

If one is exposed to a narrow band of opinions that consonates with one’s own opinions, the individual’s measure for diversity will likely be affected negatively. I.e. even if someone thinks he/she is receiving a diverse set of news items, it may actually be very biased for an outsider.

Two distinct communities based on political retweets, the left/right leaning prediction is 87% accurate, Conover et al.

A side effect of these filter bubbles is that citizens can be easily identified as belonging to any particular political group, say a group of dissidents, by for instance the government, insurance companies or potential employers.

From being surrounded by opinions that resonate with your own opinion to being surrounded by dissenting opinions was hard to bear for groups of left- and right-wing voters who had to swap newsfeeds on Facebook in an experiment by the Guardian. It seems reasonable to suggest that such an effect of feeling resentment when exposed to dissenting opinions becomes stronger when one is inside the filter bubble for a longer time. I.e. the longer your viewpoints are unopposed the more you resist dissenting opinions.

“For too many of us, it’s become safer to retreat into our own bubbles, whether in our neighborhoods or college campuses or places of worship or our social media feeds, surrounded by people who look like us and share the same political outlook and never challenge our assumptions.
The rise of naked partisanship, increasing economic and regional stratification, the splintering of our media into a channel for every taste — all this makes this great sorting seem natural, even inevitable. And increasingly, we become so secure in our bubbles that we accept only information, whether true or not, that fits our opinions, instead of basing our opinions on the evidence that’s out there.” — Barack Obama

Further empirical research is needed to determine the prevalence and severity of the filter bubble, but the fundamental mechanism to create it is already given by the interplay of cognitive biases and the modus operandi of revenue-driven recommendation engines. The question is; how is the mechanism that enables a filter bubble counter-acted by neutral news platforms, open discussion and the day-to-day interaction with non-like-minded people (say at work). For filter bubble research it is necessary to

  • define cross-topic diversity metrics
  • apply a fixed frame of reference to measure the polarity of opinions
  • track the polarities over time
  • segment the measuring/survey groups by type of news consumption, and extremity of the personal ideological inclinations
  • distinguish between active information retrieval and passive information retrieval

Also, and this is very important, researchers in this field should realise that the application of automatic recommendation engines to online news articles is fairly recent. Scaleable algorithms for online (as in live) news recommendations only appeared in literature from about 2007 and were likely adopted by industry a few years later.

The use of automatic recommendation engines can create a positive feedback loop because it is linked directly to cognitive consonance. The proposition ‘the internet will lead to more polarisation’ implies that effective personalised communication is by definition the dominant information source and this has not been the case yet.

What will happen for instance if working at a distance becomes more prevalent, or indeed if unemployment soars due to massive automation, and more and more time is spent online? What happens if the traditional mass media, with human editors, also start to apply recommendation engines? What happens if the algorithms behind the recommendations become 100% accurate? I.e. it is absolutely crucial to, in the very least, hypothesize limit cases.

Information polarisation: the idea that gradually, a person is exclusively exposed to a specific world view through a reinforcing feedback mechanism

microscopic: If you combine the echo chamber with an algorithm that suggests new articles merely based on the increased likelihood of clicking on articles that have been seen previously you get an increasingly narrow information slit.

macroscopic: due to the bandwagon effect, the importance of salience and the narrowing information slit of individuals on a macroscopic level the information slit will also become smaller. This is self-enforced by the effect of ranking on click-through-rates and the dependency of ranking on popularity.

Convergent thinking: the idea that the combination of the effects of wisdom of the crowd, echo chambers, confirmation bias and filter bubbles has a diminishing effect on the diversity of opinions and the effectiveness of pluralism.

As a proxy the convergence of taste is probably easier to demonstrate.

.doi=10.1.1.399.6701, Lui et al.

The above picture displays a network of books sold connected to books that are suggested, the squares indicate the neutral books, the diamonds indicate the conservative books and the circles indicate the liberal books; the different colors indicate two communities as identified by a clustering algorithm, clearly the liberal and conservative groups are inside a ‘bubble’ .

Function creep: the idea that enabling a functionality for the purpose of doing good can shift to the purpose of doing bad unwittingly.

I good example of this is Gmail. Google has an infrastructure in place to monitor your emails, determine the relevant products, allow for real-time bidding and for placing ads. Without any stretch of the imagination this infrastructure can be used by the government to monitor emails for security purposes and it makes the argument against such monitoring much weaker. As the email user has already agreed on the use of personal information for commercial advertisements why not use it to protect national security?

In fact, one might argue that the interests of the big information-driven technology companies align with the interests of the intelligence agencies. Shoshana Zuboff introduced the term military-informational complex, to describe this alignment of interests and the drive towards 'perfect control'.

We know from the documents released by Edward Snowden that this in fact has happened, and is most likely happening at this moment.

Another example is Facebook, who recently made a censorship tool to be able to enter the Chinese market, who is to say that this exact same tool will not be used by other governments and that Facebook will not use it for their own commercial benefit, for instance to safeguard access to markets?

The biggest potential function creep of all is the creation of an infrastructure, a set of methodologies and algorithms to monitor, categorize, evaluate, judge and manipulate citizens

Case in point: the large-scale, systematic use of software exploits to hack into communication devices, most recently the Vault 7 leak from Wikileaks exposed, not only the CIA actively collecting and applying the exploits but also a market exchange of sorts for personal information involving other intelligence agencies such as GHCQ, NSA and cyber arms contractors. As more and more personal information is being shared between more and more devices, it becomes both more attractive to maliciously use (or even create) exploits and at the same time the number of exploits increase due to the larger number of devices.

Case in point: the combination of behavioral targeting and machine learning technology applied to the unholy task of nudging citizens to vote for a particular candidate. Whether this had an actual effect is difficult to measure but you should keep in mind that particularly for the elections in the United States there is a only a margin of a few percent between each candidate, with a horse race for each state and as we discussed, we are also influenced sub-consciously. So the claim that for instance fake news has had no influence on the elections based on a survey regarding the recollection of received fake news is at best scientifically dubious and at worst naive since it implies that people are aware of the external influences of their state of mind.

Data analysis: you are doing it wrong

Case in point: Sesame credit, sold as a transparent credit scoring system that promotes transparency and honesty, is really the precursor for a social credit system. Is this really inconceivable in the west?

Technological progress poses a threat to privacy by enabling an extent of surveillance that in earlier times would have been prohibitively expensive.

— U.S. v.s. Garcia, 2007

No it is not. In a fragmented form it is probably already in place.

Other examples are Google DeepMind’s automatic lip-reader and public face recogniser FindFace used on Vkontakte, a technique which in one form or another was implemented on Facebook: it is easy to imagine the surveillance possibilities this gives to benevolent actors. Let me spell it out for you, this technology enables anyone with minimum technical knowledge to scan through publicly images, recognize your face automatically and attach metadata to it. Not clear enough: suppose I take a frontal picture of a random person on the street, suppose I am the most perverted sadist on the planet and I want to abuse lonely women, I see a woman of my fancy, take a picture, and get a feed of her personal information.

More clarity required? OK, what about Google getting in the insurance business? The use of detailed and intimate personal information to determine insurance premiums will undermine the solidarity principle. Remember that this personal information was originally shared to cater personalised advertisements. In fact, this personalisation has been presented to you as an argument to agree with the terms and agreements.

I already mentioned the facial-recognition algorithms used by Facebook and on Vkontakte. Besides identifying who you are, image data can be used to estimate certain traits such as your sexuality or your tendency towards criminal behavior.

The Evercookie is a JavaScript-based cookie built by Samy Kamkar that was used (or at least investigated for use) by the United States’ National Security Agency for tracking internet behavior on the Tor network.

Leaked presentation slide from the NSA

It is not hard to imagine that, in the mean time, a technology like stateless tracking has ended up on one of those slides.

Perhaps the most explicit and cynical example of function creep is Palantir that is built on technology developed in Silicon Valley for completely other purposes by a man that became wealthy from these technologies. Palantir works for the United States government, perhaps they worked on the disposition matrix, that was used to help determine kill targets;

Machine learning model: an inferred simplified description of reality that allows for the approximation of classifications based on observational data.

Take for example Facebook's automatic face recognition software that is able to automatically detect your face and your friend's face. This model will not be re-trained from scratch every time you go online, hence it is persistent between the training periods. This information is stored in bulk and processed in bulk. The same holds for Google, that keeps track of your emails, your search queries and your browsing history. This data is stored and analysed periodically.

In fact, this same principle holds for all trained models. In other words, even if your personal data is not stored explicitly, your model is. That is, the model from which can be inferred who you are and perhaps what your sexuality or political preferences are based on minimal online information, such as your 'like'-behavior, is stored and available for application.

Privacy legislation should involve not just raw behaviorial and personal information, but also the models from which personal information can be inferred.

Echo chamber/circle jerk: the notion that one is more likely to communicate with people that are likeminded

The reason is incredibly simple but robust; by avoiding non-likeminded people you avoid questioning yourself. This ties in closely with cognitive dissonance: it requires energy at the physiological level to restructure your thought patterns, whereas you feel a positive sensation when your ideas are confirmed (consonance).

This research states that selective exposure to information (which will naturally be the case in a filter bubble) can also generate echo chambers. I would like to re-iterate that classical media (like newspapers) are not self-enforcing cognitive consonance based biases, whereas online interactive media, clearly are.

Groupthink: the idea that within a group, the desire for conformity and harmony leads to dysfunctional decision-making, self-censorship and intolerance to dissidence

This ties in with the chilling effect, were the fear of expulsion from the group leads to self-censorship.

Tribes, filter bubbles, echo chambers, all of these represent communicative ecosystems that represent a group that is related to your own identity. To some extent you will already be engaged in groupthink processes. Most of these groupthink processes will be benign, from friend groups to fan pages, and will not be detrimental to the diversity of opinions you receive and accept. It is easy to see how online forums or chatgroups predicated with a particular and specific ideological stance can quickly turn into a breeding ground for extremism. A good example is the online spread of Islamic fundamentalism or nationalist extremism in closed webspheres.

Chat bots and the automated sock puppets: smart agents that verbally interact with humans/respond to events

I distinguish two types at present:

  1. Customer facing chat bots that facilitate large scale, low cost, customer engagement and feedback and complaints handling.
  2. Chat bots that respond to news events and public statements to have a maximum reach and impact of ideological propaganda.

The second variant can be used commercially, it is easy to see how: whenever there is a large news event that underlines the need for your product you start a mini-campaign on social media in relation to this news event. The second variant can also be used politically/ideologically. Analogous to the commercial application, an ideological or political actor can activate bots whenever the ideology or politics can be positively associated with current events. Whereas the commercial application merely has a small risk of diffusing news dissemination, the political/ideological application can effectively be used as automated propaganda.

This is demonstrated by chat bots on Twitter during the last US presidential elections. A significant percentage of the tweets sent in relation with Trump and Clinton originated from chat bots.

Onur Varol: “In this visualization of the spread of the #SB277 hashtag about a California vaccination law, dots are Twitter accounts posting using that hashtag, and lines between them show retweeting of hashtagged posts. Larger dots are accounts that are retweeted more. Red dots are likely bots; blue ones are likely humans.” businessinsider.com

The power of these chat bots is that they can be deployed at scale and they can be online continuously. At this time (2016/2017) there are about 50 million Twitter bots. The use of ‘fast’ media such as Twitter and Facebook avoids discussions that exposes the true artificial nature of the actor and makes viral growth more likely. Making matters worse, people do not read past the headline before sharing the information with friends, and are influenced by it themselves. This means that ingesting fake news or propaganda by using multiple chat bots is relatively easy.

There will be another variant of the chatbot, the automated sock puppet: Chat bots with a fake human identity that actively engage in online discussions to further a political/ideological agenda. This will be a natural evolution of applied artificial intelligence as AI researchers are striving to pass more and more advanced Turing tests. I estimate that to prevent this, absolute transparency of online users will be propagated by state actors, e.g. by strongly coupling a personal identity directly to a unique online identifier.

Self-enforced truth, or rank-based bias: the idea that through link/recommendation based ranking information becomes authoritative in a self-enforced manner

What does this mean? The more an information source is displayed through search engine results and automated feeds, the more likely it is shared, which increases the likelihood of it being displayed to others, and the more likely it is accepted as fact due to shear commonality. This is basically an effectuation of the bandwagon effect, a type of cognitive bias.

News feed item ranking effect on CTR , source

In a digital era where content consumption increasingly originates from search engine results and automatic suggestions there is a point in time for each new website or app that relatively few people have used it and yet it needs to be found or suggested to others. This means that either the content providers and search engines have the responsibility to pre-select these possible winners or the product owners have to invest money in promoting their products. The more dependent companies or individuals are on these rankings and automated suggestions the more they are willing to spend on advertising and the more they are reliant on special persuasion techniques to attract new customers.

I.e. the mere power to generate initial selections (seeds you might call them) of top sites or top apps can determine the outcome of what is and what is not a successful service or product. This power, which is basically a form of control over the supply of information creates a demand for behavioral targeting and basically presents an economic tollgate for newcomers.

The same holds to a lesser degree for ideas. If the distribution of information is primarily dependent on search engine rankings and automated feeds, idealists will have to employ the psychological persuasion and addiction tools discussed earlier and will perhaps even have to pay the information gatekeepers to attract a reasonably-sized audience.

It will no longer by sufficient to rely on the power of the idea itself if the viral spread of information requires the cooperation of large commercial entities. In fact, it flies in the face of a free, open and neutral internet.

One obvious result is the creation of online monopolies, even without the network effect (that at least explains the dominance of current social media platforms). The mechanism is already explained, and can be simply stated as

Popularity feeds popularity until it achieves exclusivity

which means that the window of opportunity to become a major player in online social-media markets is limited to the period of infancy of those markets.

To counter this, the importance of popularity and cost-per-click on ranking should be lowered in order to enable a larger pool of initial seeds to grow virally. I would even propose to completely ban financial incentives as weights for ranking news and otherwise informative articles.

Algorithmic discrimination: the idea that the use of algorithms for product placements may lead to an enforcement of prejudice

The obvious form of algorithmic discrimination is caused by the enforcement of discriminatory bias that is available in the data used to train the algorithm. There are easy fixes to this problem. To start with the sensitive features (ethnicity, gender, religion, etc.) can be distributed uniformly over the classification, this means that the sensitive features themselves have zero predictive power but can still be used to distinguish clusters. Another fix is to equalize the true positive or true negative rate across the sensitive features with a threshold classifier.

This does require the acceptance by policymakers that on aggregate the performance of these algorithms will decrease, because the amount of data will effectively decrease due to the forced uniform distribution and the initial distribution over the sensitive features will likely have had predictive power which is now lost, or because the optimization procedure is not fully focussed on accuracy. However, if the assumed feedback effect of such algorithmic discrimination (ethnic profiling for instance) holds true, in time the algorithmic performance will start to increase.

Another form of discrimination is price discrimination. The use of algorithms to optimize the probability of a sale or a click can also be used to offer user-specific prices to increase the average revenue per user. I.e. not only will that lead to people of low income being offered cheaper products, which in itself can be ethnically/racially discriminatory, but also people of higher income can receive similar products for higher prices, which is not only a violation of basic consumer rights but can also be ethnically/racially discriminatory. The taxi-company Uber has blatantly stated that it is applying price differentiation based on A/B testing to maximize the price per ride based on the specific routing of a ride: i.e. travelling to more wealthy neighborhoods will cost you more money. Not only is this a text-book example of price discrimination which has implications for the rights of individuals, it also has societal implication as it strengthens existing socio-economic inequalities.

Other examples are: paying more because you have less access to competitors (retail price discrimination), paying more because you pose a higher risk for non-payment/devaluation based on your demographic characteristics (credit & risk based discrimination), paying a higher premium because you have a different risk profile (undermining risk pooling/solidarity).

This is closely related to function creep where the facilitation of one purpose (recommendation of products) leads to the facilitation of another purpose (price differentiation).

Growing information appetite: The idea that to maintain profitability and competitiveness in a data driven economy, continuously more data must be gathered, and with more features from which more information must be extracted

The 'data driven economy' is reliant on training data to generate models and insights that are sold to, for instance governments, commercial departments and ad publishers.

To maintain profitability of this new commodity, more and more detailed personal data is required to extract more (accurate) personal information. Similar to any market maturation, there is either an increased emphasize on economies of scale, and/or there is more product diversification, where in this case your personal data is the product.

So to stay competitive as a data/insights provider not only more data must be gathered, but also more features. In other words, the inevitable consequence of a data driven economy that treats data itself as a commodity is a financial push towards lower privacy standards. Either through lobby groups, or by propagating low privacy standards in a normative sense. For instance, by offering premium discounts on your health insurance if you start wearing health trackers, by filling out a question form for your general practioner on the site of the health insurance company, by kindly asking you to hand over your payment history for a mortgage loan, by storing your personal health information in the cloud or by enabling location sharing among friends.

A more direct demonstration is personal data mogul Facebook buying more, and more detailed information about it's users to feed into the ad recommendation engines. Why? To facilitate an increase in revenue, as is expected by the shareholders.

Information asymmetry: You know very little about the people that know a lot about you and you do not know what they know about you.

Because you do not know what personal information is (and has been) gathered by whom you cannot defend yourself against a possible misuse of that information and what is more, you cannot demand the removal of that information. This clearly violates the right to be forgotten.

Given that large corporations have access and control over this information in combination with a government that can in principle force the handover of this information, we in principle have a society-wide imbalance in power. It is easy to see why a government would push the so-called data economy; they are basically outsourcing and enabling mass surveillance with little to no democratic scrutiny.

There is another aspect, often overlooked; a deterioration of the online negotiation position of consumers. Simply knowing who a particular consumer is will enable an estimate of the purchasing power and the individual demand. This undermines one of the fundamental principles of the free market and leads to the aforementioned price discrimination.

Chilling effect: The idea that due to a fear of persecution and social repudiation people refrain from exercising their freedom of speech.

From available online data it can be inferred that you are are you, even without explicitly mentioning your identity, you will have left breadcrumbs that lead back to your personal indentity. If this is not a commonality now, then, given the drive towards more data gathering, it will become so in the future.

This realisation/awareness leads to an inevitable chilling effect whereby online information dissemination becomes increasingly benign in terms of government critique, level of controversy and predictability and therefore easy to control. Anonymity and privacy are key in the exercise of the freedom of speech.

We allow humour, satire or social commentary related to these topics, and we believe that when people use their authentic identity, they are more responsible when they share this kind of commentary. Facebook community standards

What can this result in?

Unfiltered broadcasting of fake news

For instance, a prime minister allegedly violating a dead animal. This unconfirmed tabloid article went viral on social media without any editorial filtering. Misinformation often comes in the form of click-bait articles, aimed to trigger our curiosity. Even though reading the content might betray the falseness people often do not read much beyond the headline. This dissemination can happen on all non-edited social media platforms, and indirectly on search engines also. During the United States elections fake news was mostly in support of the conservative candidate, but there is no reason to believe that this is specifically related to any political ideology, and it has already been demonstrated it's usefulness for the progressive left. The psychological mechanisms that facilitate the success of fake news are a human characteristic, and not some conservative or liberal tendency. Then, recognising the effectiveness of fake news to change the perception of the voters, it becomes a tool to either obtain or maintain power.

An indirect effect is that as ‘fake news’ becomes a house-hold term critical journalism can be easily dismissed as such, especially since a large part of the populus obtains their news through less established news outlets. As the credibility of the different news platforms is difficult to verify, a general mistrust of mass media ensues. From that perspective, the eagerness of governments and large media corporations to create fact-checking platforms should be regarded with the utmost suspicion as it basically forms the stepping stone to a large scale consolidation and control of news sources. I already mentioned an example in the introduction; Germany has recently put a law in place that aims to curb not just hatespeech but also fake news by obliging social media platforms to remove such content with a threat of high fines. Given the complexity of defining hatespeech and fake news it is inevitable that this will not take place with the utmost prudence. In fact, even as facebook has several thousands of employees dedicated to filtering out such content, it is impossible to thoroughly check the billions of messages that are posted every day.

I suspect that due to selectively mistrusting news sources that show dissonant information people will tend to flock towards news sources that confirm their opinions, regardless of their actual credibility. I.e. information polarisation becomes more severe and the effectivity of pluralism decreases.

A concrete direct example of the risk that fake news poses is the fact that a Pakistani minister responded in earnest to a fake tweet from the Isreali government regarding nuclear aggression and of course there is pizzagate, where an man armed with a submachine gun response to fake news regarding a pedophile organisation, Hillary Clinton and a pizza restaurant (sounding like the meme it is).

In 2017 there will be several important national elections that can shape the future of the European Union, namely the general elections in the Netherlands, Germany and France, all of them founding nations of the EU. Already, fake news is being directed at Angela Merkel, and it is expected that the Dutch elections will be targeted as well, and in fact all countries that are allied to the United States. If you wonder about the truthfulness of these latter references in light of the earlier discussion you have already underlined the importance to treat this problem seriously.

There is another threat on the horizon. The fake-news toolbox will be expanded with technology that enables mimicking your voice or your face, or even you entirely.

Face2Face
VoCo

Disturbing as this may sound, and it is disturbing, the counter-effect is perhaps more alarming:

  • Automatic determination of truthfulness based on meta data and semantic characteristics: what about those false positives?
  • Truthfulness based on connectivity and so-called domain-authority: again, what about those false positives?
  • blacklisting and whitelisting of non-/trusted websites: dissenting opinions are not limited to approved platforms, are they?
  • Notice-and-takedown procedures, legal requirement to take down information once it is flagged as false (e.g. by the government): will the information platform have enough incentive, enough means and enough time to thoroughly check these claims?

all of which are forms of algorithmic censorship. This cannot be solved with more technology only if we stick to the dramatically failed paradigm that information-bites should always aim to please us. There is more at stake here than the revenue of an ad company or the average level of pleasure someone receives. There are deeper, more profound human values, that cannot be encapsulated in the probability that you make a purchase or click on a banner. Enforcing the use of encrypted data streams with HTTPS and DNSSEC to prevent the hijacking of information flows mitigates only a subset of the possible abuse cases. The same applies to blockchain technology as a means to ensure information authenticity. Why only a subset? Because actors with enough means can manipulate the blockchain, can copy and recreate the original information and send this with other HTTPS certificates, and if need be can create fake websites that contain the manipulated media with it's own blockchain entry and with full DNSSEC and HTTPS support. One possible solution is to monitor news, including rich media, check for items that are highly similar, detect the most salient/distinguishable differences and then based on media prevalence weighted by source credibility identify the version that is most likely correct.

An a-posteriori true-or-not check can be done using collective intelligence, by simply monitoring whether the readers think the information is true or not. However this has the risk to naturally filter out information that is strongly dissenting as dissenting opinions do not resonate with the general population and using fact checkers may result in exactly the opposite, where the increased credibility is used to disseminate false information.

Although the above cartoon oversimplifies a dynamic reality it does provide some inspiration for metrics that can be used to evaluate news content. For instance, it is possible to estimate the so-called polarity and the degree of subjectivity based on the text alone and much more can be done if this is related to other articles on the same topic. Whatever the solution might be, the new media platforms have the obligation to minimize unnecessary algorithmic censorship and maximize user relevance.

The worst outcome, is not that the citizens are misinformed by automated political propaganda but that they mistrust all media and are not informed at all.

Commercialisation of information dissemination

During the United States elections in 2016 it was revealed by Buzzfeed that over 100 pro-Trump sites were created and hosted in Macedonia, Europe. Why? Facebook will automatically display ads relevant to the displayed information on your timeline. Supposedly the expected click-through rate, cost-per-click and expected number of displays of pro-Trump advertisements was high enough to warrant setting up more than a hundred websites that disseminated false pro-Trump information on facebook. This feeds on the concepts of cognitive consonance that increases the click-through rate and the filter bubble that is created by facebook’s personalised news feed.

News itself has become a commodity because it functions as a carrier/facilitator of sponsored messages. The effectiveness of such sponsored messages increases roughly

  • the closer the news content is to the content of the sponsored messages
  • the more incentive the news content provides to consume products/services.

I.e. there is a dependency of the money earned with advertisements and the news content they accompany. This undermines the role of news media to act as a neutral monitor of the government, societal issues and international affairs.

In extremis this can lead to the situation that advertisements are not served with the news but the other way around.

To underline my point, the commonly used comment platform Discus introduced sponsored comments in April 2014. Basically, this sponsored comment is placed on top of a thread that is placed near a relevant article. This is the transparant variant of a more dubious practice: payed comments, i.e. a sponsored message which is guised as an opinion (remember the importance of transparancy?). This ties in with the sponsored content and native advertisement that I discussed earlier.

Another example that underlines this point is the site theodysseyonline.com that has thousands of students writing clickbait articles with sponsored content under the guise of journalism. This 'clickbait'-factory rewards writers based on the monthly views of their articles.

Similarly, now with Facebook as the perpetrator, Facebook consciously decreased importance of news from news outlets in favor of news from facebook friends, to increase the readership. To make matters worse, it seems people have a hard time distinguishing between real news and fake news, and may even like fake news over real news. The latter is easily explained by the fact that fake news is engineered for maximum effect, being more akin to targeted advertisement (and clickbait) than to actual news items.

What is more, the majority of adult Americans get their news from social media, by no stretch of the imagination this can be assumed to hold for all countries with a similar market penetration of social media actors.

Spammergate exposed River City Media as a professional spam distributor but also disclosed that about 1.4 billion sets of emails and personalia had been obtained, either by hidden online forms or through online black markets.

Through offers such as credit checks, education opportunities, and sweepstakes, this spam operation has gathered and conglomerated a database of 1.4 billion peoples’ email accounts, full names, IP addresses, and often physical address. There is evidence that similar organizations have contributed to this collection. An active market exists for trafficking in these types of lists for illegitimate purposes. — source

A side-effect of the increased prevalence of ad-based and ad-driven information distribution is the increased influence of marketing practices over free speech. From at least two directions: 1. the requirement to invest a substantial amount of money in the successful dissemination of your opinion and 2. the content limitations that are imposed by online ad-publishers, e.g. Google not displaying advertisements next to 'controversial' content.

Commercialisation of personal data

The logical requirement for targeted advertisements is the incorporation of personal information in the ad platform. This personal data, obtained through cookies and online profiles, is part of a roughly $200 billion online advertisement market. As for the black market, in 2016 the market for cybercrime was $450 billion in size, this involves the exchange of malvertising software as well as well as the exchange of personal informtion.

So, your information is worth money. The information you unwittingly give is not only stored and processed but also resold to other parties, either in raw or in aggregated form. What is your personal data used for?

To facilitate the data brokerage there are over 4000 worldwide data brokers, Equifax, Towerdata, Acxiom, Experian, Epsilon are some of the largest, collectively storing information from billions of individuals with hundreds of datapoints per individual. What kind of data is stored, processed and sold?

Remember what I wrote about function creep and inference? The use of consumer data is not limited to retailers, what you buy, and what you search for online can statistically be indicative of healthy or unhealthy behavior. I.e. by inference the use of this data creeps from retailers to health insurance companies. You are most likely not even aware of the information that is extracted from the data you generate.

Also, have a look at these great articles

In Europe (the EU to be exact) the PSD2 (short for Payment Services Directive) will come into effect in the near future. This directive has the aim to harmonize the payment systems across the EU, horizontally, and vertically. This basically means that:

Through AISP’s, third parties will be able to extract a customer’s account information data, including transaction history and balances.

Yes, third parties, e.g. Facebook, Google, Amazon, Paypal, etc. can access your transaction history, if they get the right license, and no they do not need to be banking institutions.

The 'requirement' of algorithmic censorship, long live community standards

As Facebook or any other international social network, wants to serve automated content, on a large scale, internationally, providing rich media for teenagers in Bangladesh and elderly in Canada, it must use a broad brush when it comes to content constraints.

There is the matter of nudity which triggers alarms whenever it is detected in images, even when it's societal, historical or cultural context allow for a non-sexual interpretation: examples are the censorship of a 'naked' statue of Neptune (facebook) or the censorship of an iconic Vietnam war image (facebook) and various examples for instagram. What causes these false positives? Two words, the safe space that social media platforms want to create for their users. For instagram

“We want Instagram to continue to be an authentic and safe place for inspiration and expression… Respect everyone on Instagram, don’t spam people or post nudity.”

and similary for Twitter

and Facebook

We want people to feel safe when using Facebook. For that reason, we’ve developed a set of Community Standards,…

There is of course a problem with the idea of, on the one hand creating a safe space on the individual level, and at the same time creating a platform that allows everyone to directly or indirectly interact with each other. The individual safe space for all requires continuous censorship and biased news selections. The safe space is a filter bubble.

Instagram literally wants every Instagram user to respect every other Instagram user, based on what norms and what level of sensitivity? A safe space for whom? Generation ‘snowflake’ that has been conditioned to think they have a right not be criticized? The religious fundamentalists who despise any criticism as blasphemy and use freedom of speech as a vehicle to abolish it? The alt-right and red-left that cannot bear the sight of each others points of view? Are we truly too blind to see that there is no such thing as a ‘safe space’ if the truth can no longer be discussed without fear of being judged or labelled?

It is in my view naive to think that such a vague constraint will lead to anything but a chilling effect where the only true safe space is characterised by the deafening silence of opinions that are never heard because people are too afraid of being ostracised, publicly shamed or worse.

What the social media actors likely want to achieve with the ‘safe space’ objective is a maximisation of time-on-site and perhaps a lowering of your guard. Quite simply, if you truly see the main page of a social media site as your ‘safe space’ you are more willing to trust the content, and with that the article and ad suggestions.

The safe space that the large social media actors should be creating is a space in which people feel confident they can have an open discussion on any topic without the restriction of prejudice, bigotry or condemnation. This inevitably requires moderators, and moderators are expensive. It is much cheaper to work with a notice-and-takedown principle, similar to search engines when individuals want to be ‘forgotten’. This requires cheap labor all over the world to cover all the time zones that react as soon as possible to flags being raised by offended individuals or legal entities who feel that their rights have been violated. So this implies hasty decisions by non-experts who probably reside in a culture and legal system completely different from the person that expressed the opinion and the person that was offended. In 2017 facebook wants to have 7500 employees working on reviewing offensive content. On the other hand this also implies that extremist, racist views are not dealt with as long as no flag is raised and thus extremist groupthink can develop freely. At least racist views within an online community on a social media platform can be handled without censorship; It is easy to imagine that an algorithm can detect a concentration of like minded racist individuals on say facebook. If facebook has the choice between either criminalising this community or nudging the community in a different direction then obviously the latter option is preferable if this community has not been flagged yet for hatespeech. Simply disbanding such communities will drive them into closed forums where moderation is no longer possible.

So to enforce such vague standards human processing has to take place, triggered by the complaint of any user. Thus, the so-called ‘community standards’ of social media platforms basically mean that the lowest tolerance to dissidence, critique and vice becomes the norm. Obviously this global human-processing approach is not ideal, and it is expensive, enter the next step, machine-based filtering. The Google spin-off factory JigSaw is developing Conversation AI, which is

..designed to use machine learning to automatically spot the language of abuse and harassment.. — source

I.e., a dedicated, automated tool to recognize a specific tone of voice and intent. Recall that the occurrence of function creep is unavoidable and in this case quite obvious; if one can recognize ‘abusive language’, then surely the recognition of certain ideological tendencies is a next step. More than just the development of AI to recognize certain types of language, the designated use of this technology is in the large-scale deployment on social media platforms: i.e. ideal for any benevolent actor that wants to perform an ideological segmentation analysis on the population.

Twitter, although having less restrictions on the actual content, brought online censorship to a new level by purging alt-right accounts after the election of Donald Trump, supposedly because it is cracking down on hatespeech.

In case of Facebook, the community standards are quite liberal in that they promote discussion and offer tools to avoid distasteful or offensive content. The problem here is they provide no definition of hate speech, and there is no history of prior hate speech cases on Facebook (jurisprudence?), so from the perspective of the user, censorship based on community standards is arbitrary and the tools to avoid distasteful or offensive content will only strengthen the filter bubble.

In a recent response to the censorship and fake news controversy Zuckerberg wrote the following:

The guiding principles are that the Community Standards should reflect the cultural norms of our community, that each person should see as little objectionable content as possible, and each person should be able to share what they want while being told they cannot share something as little as possible. The approach is to combine creating a large-scale democratic process to determine standards with AI to help enforce them.
The idea is to give everyone in the community options for how they would like to set the content policy for themselves. Where is your line on nudity? On violence? On graphic content? On profanity? What you decide will be your personal settings. We will periodically ask you these questions to increase participation and so you don’t need to dig around to find them. For those who don’t make a decision, the default will be whatever the majority of people in your region selected, like a referendum. Of course you will always be free to update your personal settings anytime. — M. Zuckerberg

Which is an improvement, with two caveats: (1) The personal determination of what is, and what is not, acceptable will strengthen the filter bubble. (2) the information selection will take place automatically so you don't know what you did not see. Let's give Zuckerberg some time to live up to these words, in the mean time he should start to realise that he can no longer claim that Facebook is just a technology company that happens to moderate the occasional racist video. Once you start to moderate, you are broadcasting to your users that you are responsible for the content, and then your 'censoring guidelines' will pile up, quickly:

My advice to Mark Zuckerberg; take lessons from Wikimedia, Reddit and StackOverflow with regard to community-built content, where moderators are not employees but site-members. Appoint moderators per group/page and assign them responsibilities. Let users take ownership of their 'mini-platforms'.

I wrote about the companies that provide information sharing platforms and their ambition for 'safe spaces', but there is more. As I said in the introduction laws are being drafted that force the platforms to indiscriminately, arbitrarily and extra-judicially decide on the acceptability of free speech. In Germany, such a law has been passed; in particular it requires social platforms to remove hate speech within 24 hours after it receives a report, with fines up to 50 million dollars.

A natural tendency to favor propaganda and hatespeech

Hatespeech tends to be spread virally by the supporters of such hatespeech, viral within a closed community, but still, most search engines will not be aware of that. It will be a spike in interest, a trending topic, a hot page, call it what you will but without the search engine dissecting the page in terms of it's hatefulness, evaluating the context and applying algorithmic censorship, the promotion of popular hatespeech is inevitable and unavoidable. Compare this with traditional 'soft' media selection where salient features and topics are favored over more nuanced issues.

Google might argue that it needs more information about the people that spread the contents, or that it can put websites on blacklists but then we arrive at an earlier point: this will lead to false positives, and is basically another form of algorithmic censorship when it was the algorithmic nature in the first place that enabled an artificial viral distribution.

The common believe is that viral marketing is caused by a natural cascade of increasing reach, humans providing individual advertisement to their peers, inspiring the peers of their peers to do the same and so on. Is that still the case if users are actively steered in the direction of what their peers have watched or liked? What I have discussed so far is not just the mechanism that enables filter bubbles and echo chambers, it is also the mechanism and indeed the infrastructure by which ‘viral’ campaigns can be jump-started at will. The same holds e.g. for controversial topics and political scandals.

Thanks for the suggestions..

Steering of the public opinion by relatively few actors

There is a thing called the Search Engine Manipulation Effect. The SEM effect is the result of an amalgamation of the effects discussed here and can be summarized as: the ability to significantly influence the voting behavior of undecided voters by changing the ranking algorithm of search engines.

The significant effect of search engine rankings is just one example. Facebook has experimented with the relative number of positive/negative message in their newsfeeds, and it showed that the ratio positive/negative had a significant effect on the average sentiment of the uploaded posts. More recently, it was demonstrated that minor changes in the presentation of information regarding voting led to significant changes in the number of votes.

Another example assumes a more benevolent actor in the form of a government that seeks to manipulate the public opinion in a so-called spinternet. In the spinternet a large media actor, or a state, wittingly spreads false information or false opinions: the method is deceptively easy

  • mechanical turks are hired to write propaganda on blogs and forums
  • fake news stories are created and peddled to renowned news sources for further distribution

The large media actor, or simply the actor that is powerful enough to steer media actors, can now control the public opinion under the guise of social media activity. A simple example is the control that Facebook exercises over it’ s newsfeed; young journalists that were employed as contractors for the sole purpose of curating the newsfeed told that conservative news was suppressed during the election period.


But..we need a personalised filter, right?

Surely, without any kind of personalised filter we would be lost in the huge forest of information that can be found online. There is too much for us to handle, right?

Wrong!

We do not need a filter, we need information that is indexed properly! As for the ads being displayed, it is not in the interest of the consumer that he sees unsolicited advertisements at all.

Suppose I am searching for a particular type of product, to buy from a local store. In the personalisation paradigm you simply type the product name and it will serve you results based on your location and your preferences.

Or pick whatever example you like, in general it will be something like

{identifier of subject}

where in the personalisation paradigm, for your convenience, the following attributes are inferred from your personal data (among other data sources) that has been collected by the search engine provider, either from their own data set or from data purchase from third parties:

{description of subject} think of location, price, etc.

{subject type level 1}…{subject type level N} think of genres, topics, etc.

{information retrieval purpose} think of information, consumption, etc.

So, the price for not having to specify these labels is that the search platform needs to have (processed) your personal information. Whether that is worth it depends on the added cost of having to specify this extra information. For me personally;

I do not want a search engine to feed my cognitive consonance, I want it to facilitate my curiosity.

Whatever you search, it can most likely be depicted as a tree, and you can very quickly walk through a tree, especially if you know what to look for. This does not require personalisation as much as it requires a very basic understanding of your search goals, an interactive search tree and advanced topic analysis for all indexed webpages. Combining this with an ‘intelligent’, on-demand query assistant leads to a new hybrid search engine paradigm: (1) on-demand personalised semantic search querying in combination with an (2) interactive search tree based on topics and relations.

In a collaborative effort to schematize the indexed information, and to make it easily searchable schema.org was erected by Google, Yahoo, Microsoft and Yandex, such a schematised indexation would certainly help train this imaginary system. It goes to far to discuss the details of a new tree-based interactive search engine, but I have no problem envisioning an engine without any personalisation, do you?

Google's and Facebook's hunger for personal information is primarily meant to serve their business model, which is displaying advertisement space, and secondarily to improve the quality of their search results.

Size matters?

Facebook, Google and Amazon are mentioned several times in this text, does that mean, size matters? Yes, but only when it comes to individual exposure. Facebook, Google and Amazon are only three actors, in a large playing field of information-based companies who strive for maximum readership, click-through-rates, costs-per-clicks and return rate. The technologies they employ are however broadly used in e-commerce. These smaller companies are building inhouse tooling for recommendation, personalisation and rank. This is possible due to an influx of data analysts, the accessibility of high level machine learning libraries and the scaleability/affordability of computational capacity. This inhouse tooling is likely proprietary, i.e. closed-source and thus non-transparant. The reason is simple, personalisation technology has become business sensitive information.

Therefore it is much harder to perform such analyses over a broad range of e-commerce companies simply because there is not enough data, at the same time the combined effect of these smaller online companies might well be similar to the effect of Facebook, Google and Amazon.

Change, now. Hard lines, that should not be crossed

What rights need to be protected? What is the minimum level of protection? What should be the penalty? I.e. what are the moral constraints at play here.

My information is mine

Personalised communication should be based on data that I control. Data that external parties can access only if I agree, when I agree and how I agree.

Suggestion; the current client-sided cookies and any personal information now stored on site back-ends should be replaced by/upgraded to locally stored and fetched encrypted super cookies that can only be accessed on visiting the website using a temporary public key. Online central data lockers only perpetuate the idea that your personal information is a marketable good and should be dismissed.

But what do I know, I am just throwing an idea out there, feel free to post your ideas in the comment section.

I should know what they know and what they think they know

To be able to make a conscious informed decision with regard to the sharing of personal information through a particular service one needs to know

  • with whom this information is shared and if this is payed for
  • what other information is available regarding my activities
  • what aggregate conclusions they have regarding my person

Ideally, one is informed about the nature of the possible analyses that is intended, perhaps this should be presented as e.g.

  • personal predictive
  • personal historical
  • aggregate predictive
  • aggregate historical

I should know when my personal information is being used

Think emoticons, flags, alerts, anything intrusive enough to make you realise that someone is watching you. This will undoubtedly cause an initial chilling effect, but it will do something else also. It will create a demand for technology that revolves around the protection, anonymisation and control of personal information.

Terms and conditions that make sense

Starting with a concise intelligible explanation of the most important aspects concerning my individual rights.

I should have a choice

A choice between a service and no service is not really a choice. A cookiewall is an example, it being literally a digital wall standing in the way of a customer and an online service. A deterioration is in sight in the European Union where, according to this law proposal, prior consent will be required for any kind of tracking and websites are allowed to block users that have ad blockers: hence the agreement to accept online tracking is a carte blanche acceptance for all websites. What about information retrieval, should we not have a choice what bubble we reside in? The start-up Refni hopes to tackle part of that question by allowing users to choose their bubbles for information discovery. The somewhat older start-up News360 offers something similar for new delivery.

In a more general sense: do we really have a consumer choice between free services that exploit, resell and distribute our personal information for advertisement revenue or non-free services that require a direct financial compensation but respect our privacy?

No reselling of information without explicit consent

The agreement regarding the information exchange should be between the service provider and the customer. Each other party that indirectly obtains this information should be explicitly mentioned in the privacy statements and the initial agreement. I.e. no laissez-faire data reselling.

Regardless of consent, shared personal information should be relevant for the service at hand. From the viewpoint of proportionality and subsidiarity, the information should be required for a legitimate aim and there should be no other less intrusive way to fulfill that aim.

Public announcement of fake news

According to Van der Linden et al. fake news can be countered by 'inoculating' the readers with an awareness that a) fake news (regarding in their case, climate change) is being circulated, and b) what the actual scientific consensus is. I.e. the users are primed with the 'truth'.

This is relatively easy to generalise: one simply, monitors actual user responses to news, enable fake news flags, user-based news ratings, one can monitor the CTR/share rate and one can cross-reference the information with trusted news sources. If there is a strong indication for fake news you perform a manual verification and then send out an announcement to all users.

No persistent storage of personal information without explicit consent

Without explicit consent the service provider may not store your information on persistent storage media (HDD, SSD, etc.). By default personal data may only be stored in volatile storage media (RAM, CPU-cache, etc.). This means that by default, due to power outages and memory corruption, your personal data will be lost, even without legal limitation on the storage duration. I.e. over time this leads to a natural decay of online personal information.

News on social media should be identifiable and easily amendable

Suppose fake news has been spread that has negatively portrayed a political candidate. Suppose you are able to either centrally remove, replace or edit those ads/articles as far as they are still being displayed. Suppose the social media actors, have the legal obligation to remove this information, after a notification. Then, upon identifying an article as fake or partially fake it can be centrally edited. This comes with the large caveat that news can be edited a-posteriori, potentially also by the benevolent actors.

To mitigate this we need

  • unique identifiers per article that are checked at the client-side: a hash for instance, or with the aid of blockchains,
  • transparent editing: any a-posteriori edits should be visible to the reader.

Of course this is a burden for the social media actors, but it also forces them to crack down on the false news content and the quality of news redaction.

Stay away from political news..

The negative effect of recommendation engines on pluralism is best demonstrated by the earlier mentioned Facebook/Trump example. In either clase it should be clear that John Stuart Mill was not referring to ideas as an economic good when he was talking about the marketplace of ideas.

Groupthink detection

The creation of filter bubbles is the pre-cursor for groupthink, which may result in extremist views being resonated and amplified by likeminded individuals. Such an escalation can only occur because there is little to no counter-narrative. If such groupthink is detected the sub-forum should be exposed to dissimilar views, and interaction with these dissimilar opinions should be facilitated.

I understand that technically this is perhaps not yet feasible, but if possible it would avoid the use of censorship, by say banning such extremist groups which will only solidify their extremist stance.

Algorithmic transparancy

By now I hope it is sufficiently demonstrated that algorithms directly affect our personal lives and even the inner workings of our democracy.

As algorithms are directly or indirectly responsible for these decisions they themselves should become the subject of scrutiny which can only be attained if they are transparant.

Well known is the effect of algorithmic changes of Google's ranking engine, these algorithmic changes are a vital but highly unpredictable and non transparant traffic factor for online SME's. A significantly lower ranking in search engine listings can cause a major loss in revenue. One of those algorithmic changes involved the activation of a supporting algorithm called RankBrain that basically uses personal information and anonymised search results to 'guess' what the user is searching for. This machine-learning driven algorithm tends to suggest search results that are more likely to be clicked on, and we already saw that fake news has a higher clickthrough rate than real news, so yes this algorithm actually favored fake news articles.

The effect of a Facebook.com algorithmic change on Guardian's reach, source.

Another example of the impact that an algorithmic change can have is shown above. In a matter of days a facebook update dramatically reduced online readership for the WSJ, the Washington Post, the Guardian and Mashable.

Throwing tech at it is not enough!

Automatic truth-detectors: at the price of freedom of information and the freedom of speech? Automatic truth detectors are bound to look at the normalcy of expressions, their rates of adoptions and how people respond to them. This invariably leads to an oppression of dissident speech in favor of the status quo. The reason that an online automatic truth detector cannot exist is that often, facts, can only be verified offline. This requires 'boots on the ground' in the form of investigative journalism. We cannot 'machine learn' our way out of this conundrum.

Diversity increasing recommendation engines? Who controls the dials, what is diversity? Again, algorithmic transparancy is key. But, indeed, this would be a necessary step if we are to continue with the integration of recommendation engines in our lives.

We should keep in mind that the unhindered application of technology facilitated the above issues. The transition away from human-selected centrally produced news articles to machine-selected unfiltered news has overall resulted in a much higher accessibility of information, a much lower threshold to share information and yes a much higher likelihood of being exposed to an abuse of these possibilities.

To counter this abuse we need humans again, for accountability and for a verifiable version of the truth, so cancelling out humans in favor of the demi-God called machine learning is not the best approach, Facebook. What is the best approach? My guess:

A combination of digital and analog journalism with a transparant measure for credibility attached to journalists, news sources and news articles

For now, factcheckers will do. With regard to filtering rich media content for violence, pornography and what not, perhaps here machine learning should be emphasized more, for instance to perform automatic blurring, to avoid PTSD of the human filters.

A coding code of conduct?

Given that programmers and machine learners are instrumental for the infrastructure layed out in the above text, empowering them to say a firm no to their employers will certainly help to create an ethically sound IT-core. It is beginning to sink in with the Silicon Valley crowd that there is a thing called ethics. The evangelisation of this strange concept, where business actually have a responsibility towards society is pushed by few people, like Adam Alter, Joe Edelman, Tristan Harris and his organisation Time Well Spent.

Of course, the control that human programmers have will diminish with the increasing penetration of AI in coding and machine learning development.

Hence, we have to act quickly.

We should monitor the monitors

The open source software openWPM was used by Princeton researchers to inventory the use of different types of tracking cookies. Such a tool can be used not only to check what types of cookies are used and if they are compliant with regulations but it can also be inferred what personal information is being logged. This is type of research is crucial. Their research demonstrated the wide-spread use of stateless trackers for instance and at the same time demonstrated the ability to detect these trackers using advanced data analysis techniques. The integration of such technology in privacy tools is crucial for protecting your personal data and enforcing privacy legislation such as the European GDPR.

Strict enforcement of http-header protocols

Public http-traffic should not be allowed to carry along arbitrary, unregulated http-headers and this should be hardcoded in the http(s)-protocols.

To start with, the experimental X-headers in HTTP-requests should be banned from live applications, specifically in case it can be exploited for tracking.

An alternative..opportunities!

The technology should serve us, but how? What if the algorithms are able to take care that the behavior they stimulate are in line with our goals?

So, maximisation of..

  • time well spent
  • information that is relevant for pressing societal and environmental issues
  • information that motivates and inspires
  • information that is so dissonant with our opinion that we enrich our knowledge and widen our scope but not so dissonant that it justifies our internal filter bubble
  • consumer behavior in favor of the so-called long tail products, an undelivered promise of recommendation engines
  • ?

Is a personalised communication service aimed at improving your life and protecting your privacy not worth a few investment dollars?

How?

  • Hybrid director steered machine-human recommendation systems
  • À la carte recommendations
  • Diversity maximizing recommendations, both on an individual and on an aggregate level
  • User-initiated content enrichment
  • AI augmented/assisted content selection
  • Modular machine learning models
  • User awareness regarding news diversity and their filter bubble
  • diversity/position awareness tools to assist human media editors
  • newsfeed as a service: dedicated newsfeed providers that collaborate with bonafide news agencies and journalists and are solely responsible for ensuring the quality of the articles

Let the age of privacy begin!

Concluding

The way that automated information feeds and search rankings are used is damaging to the pluralism of our democratic societies, undermines our right to privacy, takes away the people’s control of the online information flow and it lays the IT foundation for any government or large corporate actor to monitor and control the population. Furthermore the increased reliance on algorithmic decisionmaking (from pricing to information ranking) has created mechanisms that lead to arbitrary censorship on a massive scale as well as an overall decrease in information accessibility and an increase in automatic discrimination.

Mitigation lies in the development of

  • more holistic user-centric algorithms that not only optimize conversion but also take well being and time-well-spent into account, hence recommendations need to move away from the current business-centric approach
  • ranking algorithms that emphasize relevance over popularity
  • transparant, and on-demand personalised recommendations that enrich user experience
  • online personal data protection technology to replace the use of cookies and the reselling of personal data to third parties
  • recommendation engines that maximize diversity and conversion simultaneously
  • non-discriminatory algorithms
  • awareness/education among software developers with regard to privacy-by-design and diversity-by-design principles, technically and ethically
  • legal/moral/technical frameworks that allow a concurrent worldwide development of the above items, for instance absolute metrics for diversity and machine learning algorithms to extract higher level meta-data from content
  • ..and finally, legislation to keep the media companies (yes you too Facebook) in check. For instance, legislation that enforces a clear distinction between actual content and advertisements and that curbs the use of click-bait tactics.

Another matter is the formal responsibility of social media platforms. Not so long ago a company such as Facebook would be considered as an information intermediary, relaying information published by third parties, and allowing a fairly hands-off approach. This no longer holds, the large social media platforms are actively censoring, redacting and controlling the contents and in doing so they are in fact media companies that have to comply to the regulations, standards and responsibilities that come with that title. This automatically has the effect that Facebook becomes liable for fake news, defamation and hate speech more easily. Especially since the recent Delfi AS v. Estonia ECHR ruling, that basically gives media platforms liability over news comments. Ironically this will expedite and even necessitate the control of the information transmitted through Facebook even more. The alternative, that Facebook let's go of control, is unlikely due to prior commitments that determine their cashflow. This opens up space for a competitor that is similar to Facebook but that allows for anonymous and uncensored debates.