Stories by Zach Scott on Medium

Monte Carlo Without the Math

Zach Scott — Fri, 15 Feb 2019 18:02:38 GMT

Okay, maybe a little math

Monte Carlo Casino and Garden, Monaco

Monte Carlo simulations are extremely common methods in the world of data science and analytics. They can be used for everything from business process optimization to physics simulation. Unfortunately, the math of Monte Carlo simulations is often unwieldy and can be intimidating for people without strong math backgrounds. More importantly, the actual implementation of Monte Carlo methods is very difficult to explain succinctly, especially in a meeting with senior leaders. The goal of this article is to explain Monte Carlo simulations using an analogy that is approachable for non-technical readers without resorting to dense math or coding that is difficult to explain to non-mathematicians.

Before we discuss Monte Carlo methods themselves, we need to do a little background work to establish a basic statistical framework, and the best way to do that is with a dart board. Darts is a game of skill (or maybe chance if you’re really bad at it). If you have a dart board with values from 1 to 5 and you throw a dart at it, you’re going to earn a certain number of points based on where the dart lands in the dart board.

Circular dart board with labeled point values

If you throw 10 darts at the dart board and add up all the scores, you’re going to have some amount of points between 0 and 50. For a casual player who doesn’t play darts very often, this point total will probably add up to something around 25. If you’re really bad at darts, that total may be around 10, and if you’re good at darts, your total points will probably be higher. Let’s say 40.

Dart board and chart showing example number of darts landing in each region by player skill level

Now let’s say you’re rich and want to buy out the whole bar so you can have all the dart boards to yourself. It’s probably not much of a stretch to say that throwing 10 darts at one board is probably going to be very similar to throwing 1 dart each at 10 identical dart boards. If you score about 25 points throwing 10 darts at one board, you’ll probably also score about 25 points throwing 1 dart each at 10 identical boards.

Ten identical dart boards

If you do much work in data science, you can probably see where this is going. Since you’re rich and have bought out the whole bar so you can have a party, you decide that having 10 identical dart boards is kind of boring. You decide to make each dart board a different shape with different points. Because you have different shapes with different points now, it’s unlikely that every single stall will have the same average score. If one dart board is a quarter of the size of the original, point totals are going to be lower. A board with a giant bullseye is going to have higher score totals.

Ten different boards of various shapes, layouts, and sizes

Now your darts party is really taking shape. The veterans will have the added challenge of more difficult boards, the beginners will have some boards that are easier to help them learn, and some of the boards are so weird that you have no idea what the distribution of scores will be. To add a bit of incentive, you’re going to have a tournament. Every person who attends gets 10 darts and gets to throw one dart at each of your weird boards. They’ll add up their scores from each of the boards, and they’ll submit their total to you.

Score sheets for 7 players including points per board, total score, and total per board

Now it’s much harder to predict the average score. Each board is different, and each person has a lot of variables contributing to their ability to hit any given board. You can’t really say what someone’s total score will be. You might have a gut feeling based on what you know about the person, but that’s all you can have.

Without Monte Carlo simulations, this is the situation many decision-makers find themselves in. They are faced with a very complex system, and they might have some gut feelings about the potential likely outcomes (e.g., better players will have higher scores on a smaller dart board), but they don’t know how to integrate their intuition about individual variables into a complex system. So they go with their gut, maybe with minimal experience related to somewhat similar situations, and it might work out or they might be in for a surprise.

Going back to the game of darts analogy, say your darts party is a huge success. Everyone loved it. There was a party planner there. A representative from some new sports television station was there. They both loved it. They want this game concept to go big, really big. Now your multi-board darts competition goes international. There are hundreds of Monte Carlo darts competitions. Millions of people play, and you have all of their score totals.

Now you’ve got some real data. Based on the scores of millions of people throwing darts at the same 10 boards, you have a pretty good idea what a “good score” is per board. You know what the average score is per board. You know the average and standard deviation for total score. You can predict the basic statistics of the game.

This is the power of Monte Carlo simulations. Each dart board represents a variable, and the values of those variables combine into an aggregate outcome. If you know the general distribution of possible scores for each variable, you can use Monte Carlo simulations to predict aggregate scores across multiple variables.

Monte Carlo Without the Math was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Digital Feudalism

Zach Scott — Tue, 23 Oct 2018 17:55:27 GMT

How the data ecosystem is becoming medieval

Queen Mary’s Psalter (Ms. Royal 2. B. VII), fol. 78v. Downloaded from Wikipedia.

Unfortunately, our modern digital ecosystem is rapidly organizing as a feudal power structure erected in parallel with our existing power structures and freedoms. In the common conception of feudalism, a relatively small rentier class (lords) uses a variety of subtle and overt political and economic systems to extract value from the daily lives of the producer class (peasants), who generally have little choice or influence over the system. Users participate in platforms, often with only the barest knowledge of the data they surrender, and that data is then used to generate value exclusively for the platform owners. But there are far more parallels between our current digital ecosystem and feudal societies than can be described with broad strokes, and the commonalities are so striking that it could be argued that the digital ecosystem, and especially the social media ecosystem, constitutes a de facto feudal society.

Medieval Times

Before we get started, it’s important to briefly discuss the nature of actual historical feudalism. It wasn’t all, “Knights and Lords and Ladies, huzzah!” as we might think from popular media. That’s broadly referred to as chivalry. More accurately, there were definitely knights and lords and ladies (or whatever terminology a given society used for warrior and noble classes), but they were a very small part of the population, which was mostly serfs (or whatever term the landowners used to describe “the rabble”). To clarify the definitions, chivalry mostly encompasses all the “fun” parts of feudalism we like to see in movies and books. The other half of feudalism (the forced labor and masses of unwashed peasants beholden to a Noble lord) is mostly referred to as manorialism.

As might be expected of medieval political systems, most feudal societies are primarily centered around agriculture. The feudal system used to manage agricultural production and rural life in general is referred to as manorialism. In manorialism, a lord controls a certain area of land, which has usually been awarded to him by a more powerful lord or king. Peasants reside on this land, and they are generally politically and economically subject to the lord and his collection of various family members and retainers. More generally, peasants could be subclassified into freemen, who generally either owned their own parcel of land or paid some sort of monetary rent; serfs, who were obligated to work plots of land for their lord as a sort of unfree labor; and slaves, who were considered property and had few legal rights. Although there is still debate about the transition from slavery to serfdom as the dominant form of unfree labor in Europe, it is generally agreed that as slavery became less common in Europe, serfdom replaced it.

In many cases, serfs did have basic rights to things like common grazing land for livestock and small gardens for personal use, so they weren’t technically slaves. However, the serfs were also required to cultivate tracts of the lord’s land as a sort of forced labor or, under more capitalist feudal systems, pay (often exorbitant) taxes. In order to live and produce enough food to sustain themselves, the serfs had to bow to the production requirements of the lord. Similarly, because serf obligations were directly tied to the land that they occupied, any exchange of lands meant that serf obligations also transferred. In many cases, serfs were regarded as simply a productive feature of the land itself, much like a fruit-bearing tree or herd of deer.

A layout of a common manor. Mustard-colored areas are owned by the lord (the demesne) and crosshatched areas are owned by the church (the glebe). William R. Shepherd, Historical Atlas, New York, Henry Holt and Company, 1923. From Wikipedia.

Peasants, whether serfs or otherwise, were also often required to use the fee-based services provided by the lord’s manor. For example, many manors had grain mills that could be used by the peasants to grind their grain products into flour, but these mills also usually required fees or surrender of some part of the flour produced. In a way, these mills represent a sort of local monopoly, in which the only options available to the peasantry were to use the lord’s mill and pay its fees or to use no mill at all. The latter “option”, although notionally possible in some circumstances, would be functionally impossible because flour was one of the few easily stored calorie sources.

Judicially, the lord of the manor was generally in charge of all of his subjects. There were some exceptions (as one might expect from such a diverse and longstanding system as manorialism), and the judicial models of medieval feudalism varied somewhat, but generally, the lord created and maintained a court or series of courts that ruled over petty crimes and civil disputes. These feudal courts also existed to enforce whatever terms and conditions were imposed on the peasants and other tenants of the lord’s lands. Often, there were different rules and regulations governing each different type of occupant on a lord’s lands, and the courts generally managed these various obligations as well as any disputes that arose. Of course, there were also higher courts responsible for more severe crimes, and they were generally controlled by higher echelons of the nobility.

As with most systems of forced labor, swearing obligation and becoming a serf was much easier than escaping the peasantry. Generally, most serfs were locked into their social class. There were some ways to escape serfdom, but they varied based on society and even individual lord. Conversely, people with the misfortune of poverty or catastrophe often had to subject themselves to feudal obligation to survive. This typically involved some sort of ceremony binding the individual to the lord and his manor and potentially some sort of fee to take up residence. As might be expected, serfs usually lacked the mechanisms or resources to fulfill their obligations to the lord of the manor, so their families were perpetually bound to the land and the manor as unfree laborers.

Most manors were also inhabited by some number of freemen who weren’t obligated work the land via forced labor but did pay taxes, rents, and other dues. These freemen could be considered early versions of what might be called the “lower middle class” today, but the various classes of medieval society are so fluid and convoluted that any direct comparison is difficult. One easy way to think of freemen is as people who were rich enough to rent or buy land but not rich enough to have serfs work it for them. Thus, they had more freedom in terms of the business they engaged in, but they still had to do their own work. This is a gross oversimplification, and entire books have been written about the nature of freedom and freemen in each feudal society, but it gives a decent idea of the types of people who were considered freemen.

Some readers may already have some ideas of where this discussion is going, but it is important to explicitly parallel and contrast the current digital ecosystem with that of manorialism. This discussion will primarily center around social media companies, which are some of the largest and most mature data technology companies in the modern economy. Moreover, more traditional technology companies often seek to emulate social media companies in many of their data practices, so those will be examined as well.

Classes of Digital Feudalism. Lords of the manor: companies who own and manage platforms. Major gentry: companies with business models based on the manor platforms. Minor gentry: “influencers” who drive users to spend more time on the site. Serfs: people who submit to data collection for the right to exist in digital space

The Serfs — Users

The power structures of modern social media and the surrounding data science ecosystem are highly reminiscent of feudal social and economic structure. Except instead of producing crops and livestock, the serfs of this system produce and surrender data in the course of their day-to-day lives, and the lords of this system are tech company executives, whose entire infrastructure is built upon the collection and exploitation of user data.

Before we go any further, it is necessary to address the “voluntary participation” or consent argument of social media. Anyone discussing the equity of the data science and social media ecosystem almost inevitably encounters the argument that participation in social media is voluntary. Although it may seem like a good standard, consent is broken and generally ineffective as a paradigm for modern digital data. Modern approaches to consent (e.g., “We track you, click here to surrender your data to us and we will do whatever we want with it until the end of time.”) can be seen as similar to the way medieval peasants “willingly” obliged themselves to become serfs. Although the situation is less physically brutal in modern society than in medieval society, when nearly all of a society’s personal and professional communication is digital, nonparticipation is voluntary ostracism. Nonparticipation is also nearly impossible in contemporary society, as brick-and-mortar stores give way to online orders, as any interaction between international phones demands exorbitant prices or free digital alternatives, as parties are planned and invitations are sent out via email and Facebook events, and as online job applications become the only job applications.

This formulation doesn’t even broach the fact that for many disabled and differently abled individuals, the digital option is the only option, bringing it even closer to the “submission or death” situation. The Internet has given unprecedented freedom to many people who were previously isolated or dependent on caregivers. But with the modern internet, the caveat is always that anything you order or say will be tracked by someone, regardless of how necessary it is for your life.

If we accept that nonparticipation is a false or, at least, extremely burdensome option, then participation in the system is the primary state of being. In terms of participation, one key difference between medieval manorialism and digital feudalism is that no person is obligated to just one platform. Instead, every person has a web of obligations to different digital manors, each of which demands different data in exchange for use. For those who are concerned with privacy, then, the question becomes how much data they are willing to surrender to how many lords and whether their personal and professional life can accommodate those decisions.

As cloud-based services like Office 365 aggressively invade businesses, this decision becomes moot. Traditionally, your time is not your own at work, and your employer has always known your general schedule, output capacity, and work style. But now another third-party company knows all of those things as well, and they can easily pair them with data collected from other companies. Your time and work output might belong to your employer, but now they also do work for a second company, which uses them to do things that don’t belong to your employer. The invasion of cloud-based services into business environments is the corporate mediatisation of digital space, as previously sovereign digital environments, such as local corporate servers, cede the obligations and rights of data storage and use to third parties. In many cases, your employer may be in the exact same situation as you are in terms of choosing to surrender their data. A useful conceptual framework for corporate “cloud service” clients is as knights or minor lords, granted space and access by the lords of a platform and in charge of their own fiefdom of serfs.

One of the key disparities between the societal groups that participate in digital feudalism is the insistence that individual data is functionally worthless but aggregate data is sufficient to drive the valuation of the largest companies on the planet. This disparity (or, if you are feeling charitable, “scaling effect”) is at the heart of digital feudalism, and it is something that traditional serfs did not have to address. It is easy for a serf to appreciate the value of the grain or other crops surrendered to the lord of the manor, and it would be difficult for any lord to reasonably argue that food is worthless, especially during times of famine. There is a reason many revolutions were sparked by droughts, price controls, and other events that restricted food availability. Starvation is a powerful motivator in a way that privacy is not.

Data collection companies exploit these blind spots and insist that a single person’s data has no meaningful value except in aggregate. One could also argue that a single dollar in Facebook’s $600+ billion valuation is functionally worthless compared to the aggregate, so why aren’t they expected to surrender that “worthless” money to the people who generate the bedrock of their value? Of course, that question is intentionally somewhat obtuse, but it illustrates the fundamental power imbalance between the lords of the digital manors and their users. They decide on the value of your data, but they also insist that whatever benefit they provide you is of equal worth (minus whatever profit they extract). In digital manorialism, platform owners are strongly disincentivized from ever estimating the worth of a single person’s data while they gleefully value their aggregate data in astronomical terms.

Polish Gentry 1333–1434, Jan Matejko. From Wikimedia Commons.

The Gentry — Corporate Partners and Power Users

These power imbalances are further exacerbated by the fact that, the legal systems of digital and medieval feudalism also have many similarities. The “low courts” of digital manors are whatever reporting systems and staff they choose to use, and the crimes they oversee on their platforms have grown up organically. Harassment is a serious and legitimate crime, but most of the investigative and punitive power over incidents of digital harassment has been ceded to platform owners. You can certainly file a report with the police, but if someone is harassing a person from another state or country, the victim has little recourse beyond those systems established by the lords of a given platform. Copyright complaints often follow a similar pattern with the added layer of extremely stringent legal structures demanding that platform owners implement policies so aggressive that Digital Millennium Copyright Act takedown requests can themselves become tools of harassment.

As might be expected, these systems are often wholly inadequate to defend the rights of the “serfs” and tend to favor the rights of “allied classes” in the digital feudal structure. Media companies pay top dollar to advertise, both overtly and covertly, on social media platforms, so if they complain that someone is using their intellectual property, then platforms will absolutely jump on it. But what if a company steals your art or music without consent? Well, you better have your own social media outlet and a big following because otherwise maybe they’ll get around to it a few weeks from now when their big client has already profited handsomely.

In between massive companies that use social media platforms and their data for their own strategies lies a curious collection of “minor gentry”, or “Influencers.” These are people who gain a variety of perks or income from social media. They are only rarely actual employees of media platforms, but they do get paid (usually via “partnership programs”) for drawing more users into the platform, and thus, enhancing the data flows of the platform lords. Of course, they are obligated to the platforms that host them, but there is no reciprocal obligation from the platform. Many platforms will gleefully host the most revulsive content if it gains them huge numbers of users, but they will abandon that content the moment anyone starts discussing boycotts or advertisers threaten to leave.

The most successful of the digital gentry even stand to gain their own subplatforms, networks of sites built around their core audience, or lucrative full-time jobs distributed as rewards from platform lords. These subplatforms generally serve to provide a sense of autonomy without any actual freedom from data collection and exploitation. Often this involves perks like the ability to create multiple accounts around a “brand” while most individual users are pushed to only use their true identity, if multiple accounts are even an option. Many platforms expressly prohibit multiple accounts unless, of course, you run a business on their platform. When you have a business (or claim to have a business), your access to data accrued from the platform also increases. But you are never given full access to all data, even the data your subplatforms bring in because the lords of your platform are, of course, “seriously concerned” about the data privacy of their users. Translated cynically, this means that subplatform owners simply aren’t powerful or wealthy enough to warrant full access.

Portrait of Louis XVI, King of France and Navarre (1754–1793). Joseph Duplessis. From Wikipedia.

Lords and Kings — Platform Owners

Finally we come to the much-discussed lords of our digital manorialism. These are the people who own and control massive platforms and all of the data they generate. It is often tempting to discuss ownership in terms of corporate entities or organizational structures, but that neglects the core understanding of this group: they are (mostly white, mostly male) people. They decide the privacy and sales policies of their platforms. They reap the mind-blowing income of user-generated data. They sign off on any new schemes to acquire more data or to be more aggressive in the assertion of their rights as sovereign rulers of their digital domains.

It is important to understand the nature of the power structures erected by digital manorialism, and it is wrong to assume that platform owners are universally malevolent or benevolent. Many attempt to be good people and may see themselves as good people. But they lead corporations, and the singular goal of a corporation is to generate revenue. Because user data is their core value offering, digital lords are forever driven to encroach on user privacy and enhance data collection efforts, lest their power be usurped by someone more willing to maximize data collection. Platform owners forever walk on the edge of balancing new ways to extract data against the risk of users abandoning their exploitative environments. This is much the same as manorial lords who were expected to supply tithes, taxes, and conscriptional levies to their patrons or Kings while still ensuring that their own manor prospered. But in digital manorialism, the ultimate power is not the ruler or rulers of a nation, but rather the economic forces of a financial market. Ironically, the organizations that extract value from their users are, in turn, beholden to collective groups of investors, although most of their users will never be wealthy enough to acquire sufficient ownership to defend their own rights.

To legitimize these power imbalances, manorial systems almost always revert to some mechanism of contractual fealty. Historically, serfs submitted or were generationally subject to a contract or agreement, the terms of which were set by their lord. In modernity, the contracts of digital manorialism are interminable User Agreements, many of which include so many clauses and such exploitative terms that they are unenforceable in many jurisdictions. User Agreements are rarely, if ever, presented in simple terms, and they are often so lengthy that they would require multiple days of reading and probably a legal degree to fully understand. Thus, User Agreements represent another piece of the artifice symbolizing user choice.

Data sharing settings are another tool in the lordly arsenal of data collection. Ostensibly, these settings allow users to manually control what data they share with others. However, this is only true in some circumstances, and the maintenance of these settings generally requires constant vigilance. In many cases, during updates or revamps, users are defaulted to the most permissive data sharing settings, and they must manually reset their preferences after every update. The most nefarious of these resets assumes backward assent: if your preferences default to sharing a particular piece of data, then the platform immediately has permission to share all of the historical archive of the data. In this way, “privacy controls” do not prevent storage; they simply restrict the sharing of this information.

Users don’t always even have these illusionary choices of assent. As data science grows, so too does a new field of employee quantification. Unfortunately, the Quantified Worker movement is simply Taylorism with a thin veneer of technology smeared over its face and adopted in a society without the labor protections of the past. But the service providers doing the quantification are almost always third-party data platforms looking for new partners to make their data collection strategies mandatory. It is no longer a single employer enacting these policies, but rather multiple powerful employers or even government entities collaborating to maximize the quantification of worker performance. Naturally, the serfs of these digital enterprises are always referred to as “users” or “participants” or some other term implying the ability to opt out, when they are actually more like subjects, whose options are to surrender their data to a third party or lose their job. Predictably, the workers subjected to these mandatory data extraction schemes can be separated into unionized workers who successful fight off these intrusions and non-unionized workers who see their quality of life and work diminish sharply as they desperately struggle to keep their jobs while performance benchmarks ratchet skyward.

So we are left to wonder what sort of event could spur more serious regulation of user/consumer privacy. You might make a logical argument for something like large security breaches or leaks, but that’s already a daily problem, and no company seems particularly worried about them. They just add the cost of an investigation into their risk management calculations and carry on as usual. What are you going to do, stop using banks because your identity was stolen in the Equifax breach? You may think that the security of your data is extremely important, but to a data company, the only difference between a successful sale and a security breach is how much they get paid for your data.

The Way Forward

It is clear that manorial systems are not compatible with egalitarian and free societies. But the trend toward digital manorialism is relatively young, and we have an opportunity to reverse course before it becomes too entrenched. The more established manorial systems become, the more broken they become, and if they continue unabated, users may find themselves as subjects of a digital Ancien Regime. There are many options to address the issues of digital manorialism without directly attacking the significant innovations and opportunities of data science. If users are generating value for a company, then they should consider unionizing. Most of the barriers to traditional unionization are linked to employment status, but platform owners have been extremely careful to ensure that users are not in any way considered employees. Although various groups have attempted to conduct social media boycotts or blackouts, they were often poorly organized or ad hoc efforts. Continuous stable user unions would offer significant leverage to break the most egregious exploitations of digital manorialism or even stymie them before they begin. They could also provide leverage for employees subjected to mandatory third-party data collection and could meaningfully lobby for regulation from the political system.

Any sustained equality between platform owners and users requires strong privacy laws and policies, which can only be effectively enacted and enforced by political institutions. In some cases (such as China), the platform owners and political policymakers are closely aligned, if not the same groups, which represents an extremely dangerous paradigm. But in many other cases, the lords of the data manors are distinct from the political apparatus, and the history of digital manorialism is one of political inaction. Many political systems, especially in the United States, have significant gaps in technological knowledge, as evidenced by Mark Zuckerberg’s testimony before Congress after the election interference scandal of 2016. This makes it more difficult for politicians to effectively defend the privacy and rights of their constituents, if they are even so inclined to do so. The European Union recently made great strides toward defending user privacy with the General Data Protection Regulation.

The bonus of these types of user-focused political efforts is that they have knock-on effects for users they do not directly protect. Many data companies are global, and it is not feasible for them to institute different protections in each region and remain compliant with GDPR and other regulations, so they frequently adopt blanket policies that comply with the most stringent regulations. This can already be seen in the new notifications and privacy options that appear in platforms operating in the EU, even if users and corporate offices are not located in that jurisdiction. As these laws and policies proliferate across the globe, companies may eventually become more selective in their blanket implementations of various rules, but for now, these are excellent tools not just to protect the citizens of one nation, but of all users of the digital commons.

Because that is what is at stake. The internet’s promise as a global intellectual commons has been partitioned into fiefdoms without input from the users inhabiting that commons. If we can change this trend now, before it becomes too entrenched, we can enhance not only the privacy but also the quality of life of everyone who interacts with digital technology.

Digital Feudalism was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Toward Reproducibility: Balancing Privacy and Publication

Zach Scott — Thu, 31 May 2018 18:31:41 GMT

Can there ever be a Goldilocks option in the conflict between data security and research disclosure?

Photo by Shiva Smith on Pexels.

Data science is an intensely interdisciplinary field that is exploding on both sides of the academic/industry divide (and everywhere in between). This has led to a surprising amount of diversity in terms of practitioner background and philosophies, and the field’s diversity translates into a diversity of outlook and policies in data science and advanced analytics organizations. Some organizations, regardless of whether they are in academia or industry, have a very “academic” flair, with leaders encouraging the development of novel models and methods and then publishing those methods and attending conferences. Conversely, other groups are focused on creating saleable data products on time and to industry standards while under NDAs and other data restrictions without as much focus on theoretical novelty. Most groups strike some level of balance between the two, but finding that balance point often requires careful consideration of technical and organizational needs and goals.

These differences in philosophy and environment have fragmented the data science field into a kaleidoscope of subsets based on data and model availability, which can make translating discoveries between environments difficult. The variety of software licenses, intellectual property protections, and privacy and security safeguards can make replication, much less reproducibility, a significant challenge. However, all enduring professions, no matter how secretive, have one thing in common: the ability for practitioners to share information and best practices with each other in a meaningful and productive way.

The only way to build a strong professional landscape is for data scientists to be able to share problems and solutions with each other because no one ever has all of the answers. This can be extremely difficult in a field in which many cutting edge models and large data sets have access restrictions. However, other fields have overcome similar challenges, and the sooner data science looks to its predecessors and allied fields for inspiration, the sooner it can develop rigorous standards in terms of information sharing.

But before we can get to that, we have to look at the current state of the field.

GNU, from the Free Software Foundation

FOSStering Collaboration

Free and open-source software (FOSS) is perhaps one of the most liberal approaches to software availability. In this context, open-source means that the code base for a project is openly accessible for inspection, alteration, and distribution, and free (as in “liberated,” rather than “extremely inexpensive”) means that you can do whatever you want with software you have access too. There are a variety of models for FOSS, many of which make somewhat confusing distinctions between what is free and what is open source and whether a given piece of software is one, both, or neither. The easiest way to think of this is to use Richard Stallman’s definition of the 4 software freedoms: the freedom to run a program however you like, the freedom to study and modify a program, the freedom to redistribute a program, and the freedom to distribute modified copies of a program.

This type of model, or something very similar, is the general default for academic software. Most published academic work is associated with some type of publicly available code repository, with GitHub being the most common. But this model says little about burdens to accessibility, and those have become a common problem in all fields of academia, not just computational ones. In other words, it is one thing to adhere to the rules of the FOSS concept, but it’s another very different thing to adhere to its principles. Factors like publication embargoes, overworked staff, server downtime, IT rules and policies, and even patches and updates can make it difficult to reproduce or replicate past experiments in a meaningful timescale. These accessibility barriers are not always malicious or even intentional, yet they still represent a significant barrier to practice.

A lot of these challenges have been reduced in data science by the drive toward cloud-based computing. Gone are the days in which you have to email a massive zip file. If you’ve been in the field long enough, you probably even remember a time in which you had to burn data or software onto a CD or load it into a hard drive and then mail the whole thing to a collaborator. But despite the growing adoption of cloud-based services, there are still burdens that remain to be addressed. Most importantly, “the cloud” is really just a new term for “other people’s hardware,” and storing data and doing work on someone else’s systems often comes at a cost, whether notional, financial, or both. Ideally, software and models produced by research would all be part of freely accessible Github repositories (or similar). And a lot of it is. And yet, Github is still a private company, and regardless of how many models or software packages may be available there, publishing data on it or any other platform has its own issues.

Because data science isn’t just about software and code. It’s also about the data used to produce a model from that software, and therein lies the rub. We can’t always use publicly available datasets to answer every data science question, and attempting to do so limits our ability to explore real-life phenomena. And we can’t always make our data public.

Photo by Steve Johnson on Pexels.

Restricted Access

Although FOSS may seem like a rosy ideal that is applicable to most research software, the FOSS concept is not always appropriate for every aspect of data science, especially when it comes to data sharing. Some of the largest repositories of data applicable to our field have access restrictions by necessity. This can be due to a wide variety of reasons, but the most common are privacy and security. These two forces are indelibly linked to the practice of data science, whether we like it or not.

Most of the largest data sets in the world are held by corporations for which data is their primary, if not sole, product or market differentiator. Similarly, one of the most well established areas of data science, medical data science/informatics, has a wide array of data privacy and consent laws, and with good reason. Other growing fields, like operational and manufacturing data science, use data that might reveal critical business practices or even have national security consequences in the case of critical infrastructure.

These challenges can extend even further to model sharing as well, especially as the field of “adversarial data science” grows. Depending on the system, it can be possible to attack machine learning and deep learning tools to extract model information or even infer data set membership if you have access to an API or similar interface. This information may be sufficient for bad actors to cause serious harm. Because of these significant security threats, many groups have a vested interest in minimizing access to their models and data. This is especially important due to the irreversible nature of disclosure. Once a dataset publicly released, it is nearly impossible to secure it again, and no one knows what that dataset can be used for later.

On the privacy side, there are often very stringent regulations and laws regarding data privacy and availability. HIPAA, a complex set of rules and regulations protecting health information disclosures in the US, is the quintessential example of this growing body of regulations. HIPAA rules are constantly being updated as new trends are uncovered in health information and data science, and many institutions now have extensive offices and training programs dedicated to maintaining HIPAA compliance.

GDPR is a more recent example of general privacy regulations that are increasingly becoming more broad and more technically aware. As an example of how significant these regulations can be, just think about how many Terms of Service update emails you received as a result of GDPR going into effect. That’s just one regulatory framework in one region.

These rules and regulations are powerful forces to protect the privacy and security of individual citizens and organizations, but their requirements also increase the need for regulatory awareness among data scientists, something that has sometimes been neglected in our young field. It is important for data scientists to understand and accept the reasons for these regulations and find ways to work within their stipulations, lest we risk becoming black sheep. Although it may be tempting to bemoan the burdens of privacy and security on our field, a strong ethical stance that integrates these safeguards as valuable will greatly facilitate our future success as a field.

Photo by Pixabay on Pexels.

Balancing reproducibility, security, and privacy

Many subfields of data science are rapidly hurtling toward serious social and regulatory hurdles. In the past few years, breaches and ethically questionable use cases have cast a negative light on both data science as a field and the nature of the data that many of us work with every day. The “Wild West” times of data science are coming to a close, and that means we need to look beyond the “new algorithm, same data” and “same algorithm, new data” paradigms. Some of our most powerful algorithms have been very publicly exploited in unexpected ways, and the broader revelation of how much personal information is contained within some datasets has made many members of the public uncomfortable.

At this point, data science has reached its “adolescence.” There are still significantly underserved topics in which new methods and new data are desperately needed to solve fundamental problems, but we also need to collectively expand our work to equitably discuss and study data availability and how to balance it with security.

Most data science research is, at its core, human subjects research. Most of us strive to de-identify the subjects of our published research as much as possible, but we must draw careful distinction between security, in which identity retrieval is impossible, and obfuscation, in which identity retrieval is simply annoying. We must also recognize that, in many cases, security measures can easily weaken due to new research and technology that turns strong security measures into weak obfuscation over time.

One promising approach to data security that deserves far more study is differential privacy. Essentially, differential privacy is a way of “doping” raw data with some level of statistical uncertainty before disclosure. However, as most of us know, with sufficiently large data sets or sufficient system access, many types of statistical uncertainty can be modeled and abstracted away. One unreliable data point about one person’s location during one day is likely relatively difficult to verify. A hundred unreliable data points about one person’s location during a hundred consecutive days is far less secure. This is just one example in which a data security measure can become a relatively weak data obfuscation measure. It is not enough to simply inject privacy into data points. We must also consider the lifetime “privacy budget” of a subject or data set, and the privacy budget conundrum of differential privacy has yet to be reliably solved.

Still, at this point, differential privacy is one of the best scalable options we have. As such, differential privacy has been adopted by several major data-based organizations, including Apple, Google, and the US Census, each of which has different approaches to both uncertainty injection and privacy budgeting. None of these approaches is perfect, but these implementations have led to significant research into their strengths and weaknesses.

The US Census is actually at the forefront of data privacy and security in both the technical and the regulatory space. They have even developed an economic model to judge how to manage the balance between privacy and accuracy. Although this approach may not be the the optimal solution, it is a major step in the right direction, and methods for assessing and managing privacy budgets and data accuracy will likely continue to evolve.

Unfortunately, although there is significant effort and interest in methods to maintain data privacy during disclosure, there are few available approaches to model or algorithmic privacy in data science. This can make many groups, especially private data companies, reluctant to share their often groundbreaking work in the field. Open source and open access can obviously be effective for disclosure, but for many corporate researchers, they are often only selectively implemented, if they are allowed at all. There is an urgent need for new business (and research) ethics regarding model and methods disclosure, especially when paired with data disclosure. How do we value a new modeling approach? How can we predict novel use cases for our methods? Should we even consider valuation in disclosure decisions? These aren’t easy questions to answer, and they are made infinitely harder by the relatively small subset of data scientists working in this area.

As our field moves forward, this type of “meta data science” work will become increasingly important. Data and model availability is critical to the advancement of the field, but as our work becomes more visible and publicly accessible, we need to constantly assess and implement our policies of ethical disclosure. In data science, we have a field in which multiple ethical models for personal privacy, corporate security, and public good all collide, and it will take significant work to integrate those ethics into a single, unified whole that balances the greatest allowable levels of disclosure with the highest levels of security.

Toward Reproducibility

Toward Reproducibility is an ongoing series of articles discussing current challenges and potential solutions to reproducibility in data science and machine learning. Each article focuses on one of the major factors driving the reproducibility crisis in data science.

Toward Reproducibility: Balancing Privacy and Publication was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Data Science's Reproducibility Crisis

Zach Scott — Thu, 17 May 2018 18:23:27 GMT

Photo credit: Serg from the Swarmrobot.org project via Wikipedia

Data Science’s Reproducibility Crisis

What is Reproducibility in Data Science and Why Should We Care?

Hot off the heels of Joelle Pineau’s brilliant talk on Reproducibility, Reusability, and Robustness in Deep Reinforcement Learning at this year’s International Conference on Learning Representations (ICLR), it seems like everyone in the data science world (or at least in the data science research world) is talking about replicability and reproducibility.

This problem isn’t unique to data science. In fact, according to 2016 survey by Nature magazine, most scientific fields are facing a reproducibility crisis. Ironically, according to survey respondents, one of the most important factors driving the reproducibility crisis is insufficient statistical knowledge. This result is likely at least partially influenced by the high number of survey respondents in biology and medicine (906/1500), who often have suboptimal training in relevant statistics. One would hope that data science and machine learning practitioners have a higher general level of statistical training, given the nature of their occupations. However, data science still faces challenges in reproducibility, despite the field’s emphasis on statistics and deterministic modeling, and these challenges often highlight the structural and organizational forces that are driving the reproducibility crisis in most scientific fields.

What is Reproducibility?

Before we even get to addressing reproducibility in data science, we need to start with a firm definition. Chris Drummond argues that many discussions of “reproducibility” are actually centered around “replicability” (some refer to this latter attribute as “repeatability”). In his view, replicability is the ability of another person to produce the same results using the same tools and the same data. In a computational field like data science, this goal is frequently trivial in ways that do not hold for “real-world” research. Anyone can fork an open-access repository and run the exact same code using the same data and get the same result. Laboratory environments are rarely so perfectly replicable, which means experimental replication often involves some low-level perturbation of experimental parameters. Usually, even identically replicating someone else’s laboratory work means ordering raw materials from the same source, reformulating their reagents, finding similar equipment at your institution, and following their methods as closely as the publication allows. As bench scientists say, “I replicated this experiment for my own research, and the method also works in my hands.”

But the fidelity of experimental replication differs between laboratory and computational disciplines. The fidelity of computational replication is generally expected to be incredibly high. If another researcher applies the same code to the same data, it would be expected that any deterministic algorithm would produce the same or very similar results. Essentially, most open source projects meet this replicability requirement, so stopping at this level of experimental reproduction is likely to be trivial for most of the meaningful research in the field. However, despite its triviality, this sort of exercise may still be critically important to serve as a positive control for other practitioners rolling out a new tool or algorithm.

Conversely, in Drummond’s view, reproducibility involves more experimental variation. We can think of experimental reproduction as an activity that exists on a continuum from near-perfect similarity to complete dissimilarity. On the high-fidelity end of the scale, we have a forked project re-executed with no changes. On the other end of the scale, we have the sort of nonsense normally reserved for recipe reviews on cooking blogs. “I didn’t have any flour for this bread recipe, so I substituted ground beef, and it tasted awful!” In this view, experimental replication in a laboratory experiment looks more like reproduction in a computational experiment.

Good reproduction is about finding a middle ground between replication and irrelevance.

Why Should We Care?

Reproducible experiments are the foundation of every scientific field and, indeed, even the scientific method itself. Karl Popper said it best in The Logic of Scientific Discovery: “non-reproducible single occurrences are of no significance to science.” If you’re the only person in the world who can achieve a particular result, others may find it difficult to trust you, especially if they have spent time and effort attempting to reproduce your work. It is reckless and irresponsible to build a product or theory on a singular unconfirmed anecdote, and if you present anecdote as a reliable phenomenon, it can consume time and resources that would otherwise be spent on actual productive work.

Irreproducibility isn’t always malicious or even willful, but it is rarely positive in a scientific field. The effectiveness of scientific contributions lies in their usefulness as a tool or perspective for others to apply to their own problems. We admire researchers who solve problems that we have found intractable or who produce tools to address a dilemma we have struggled with. And as scientists, we should strive to produce tools and ideas that help others accomplish their own goals. In doing so, we (hopefully) enrich our own success and professional standing.

If our standards of reproducibility are lacking or if we fall into the trap of implementing the talismans of reproducibility without regard for their true purpose, we risk wasting our own, and everyone else’s, time. Science is about continuity of thought beyond a single practitioner. When we leave, for whatever reason, someone else should be able to pick up where we left off and continue producing new knowledge. Colleagues should be able to implement our ideas without us hovering over their shoulders.

Science is a way of exerting our unique experiences and interests on the world in a way that can help someone else in their own experiences and interests. We can’t always foresee how our new knowledge applied to our own interests may help someone else, nor do we need to. We only need to do our best work to solve the problems we’re interested in with reliable methods. Knowledge gained in ways that can’t be reproduced helps no one and lacks the potential to ever do so. So without reproducible practices, we are simply wasting our own and everyone else’s time.

Photo credit: rawpixel.com on Pexels.

Barriers to Data Science Reproducibility

Now that we have a basic framework for what reproducibility is and why it matters, we can start talking about how we can work to fix it. There are several barriers driving the reproducibility crisis in data science, and some of them will be very difficult, if not impossible, to solve. Common laments include data and model availability, infrastructure, publication pressure, and industry standards, as well as a host of other less frequently discussed issues. Almost all of these issues have multiple diverse drivers, each of which requires its own solution. Because we’re data scientists talking about nebulous and complex concepts, it can help to do one of our favorite tasks: classification.

Most problems have both “hard” and “soft” factors driving them. Hard drivers represent insurmountable barriers to execution. The availability of suitable infrastructure is a good example of this. Sometimes you just don’t have enough storage or GPUs available to reproduce someone else’s work. Maybe you can’t access clinical or commercial data because you can’t get permission to do so.

Soft challenges, on the other hand, represent the class of problems in which there is a notional solution but industry or professional pressures prevent you from doing so. The quintessential example of this would be the academic practitioner who really would like to reproduce someone else’s work, but can’t justify spending the time to work on something that journals wouldn’t be interested in publishing.

In many cases, addressing the reproducibility challenges facing data scientists require nuanced understanding of multiple disparate fields. Most of these problems won’t be solved with a single rule or policy, so sometimes the best solution available is to just start discussing ways we can improve the practice of data science and related analytical fields. As this series continues, I hope to take a deep dive into each of the biggest challenges affecting the reproducibility crisis in data science and discuss potential solutions that we, as a new and unique industry, can take to address these issues.

Data Science's Reproducibility Crisis was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Death in the Data Science Age

Zach Scott — Fri, 11 May 2018 16:03:19 GMT

The data of the dead still hags around. What is it going to be used for?

Continue reading on Medium »