Today’s and well, tomorrow’s buzzword — Big Data! It’s potential for research in the social sciences?

For as long as we care to remember, the world of social sciences has been trying to figure out why humans do what they do, how they do it, what influences them, what will stop them, and what that may mean for the future progress of society. Theories galore have been propounded, studies have been conducted, variables set and attempts made to understand and predict behaviour. Technology has obviously changed our lives — in so many ways. If anything, the last four years have demonstrated how, in utter and pitiless detail, our likes and dislikes, predilections and tendencies, our high points and low points are recorded, filed away and used to predict our behaviour as consumers, as citizens, and even as criminals. Predictive analysis is now a very real and vital part of crime fighting in developed countries, and it’s only a matter of time before it becomes even more widespread. So is this about all the dangers to our privacy and data protection? Sadly, it is not. While I will return to that theme in a bit, this is more about what exciting worlds we can discover within this mammoth load of information.

Effective legal and policy initiatives aimed at criminal or harmful behaviour rely on theories of human behaviour supported by empirical evidence from social studies. Social studies however, have generally relied on information from a limited sample of people, their responses to a defined set of questions and have been time-bound. While there are certainly many research studies that track their subjects over time periods of 10–20 years, and studiously record their observations, there are many which are not as tediously tracked and yet released to the general public with the tag ‘studies have shown that…’ Conclusions of several social studies also suffer from the fact that countervailing factors may not be taken into account and so false correlations may be made. While scientific studies are conducted under pristine and sterilized conditions, and controlled for other variables, social studies are almost never conducted in a setting where one can say — “all other things being equal”. So what do we do? Do we trudge forward hoping that our conclusions are still valid? Do we shrug and admit that our insights may not be capable of uniform application across a larger group, or a longer time frame? Or are we still reliant on the invariable costs and consequences of the trial and error method?

Using big data analytics to study social behaviour provides an avenue to improve these shortcomings. The sheer immensity of the information now available or obtainable makes it possible to conduct larger and more accurate studies to test behaviour. Dependent as it is on quantitative research and statistical conclusions, big data analytics eliminates subjectivity, and its methods provide a way to avoid understanding coincidence between two social variables as causality. In big data processing, multiple scenarios are simulated to test a hypothesis which allows for the accurate identification of variables and factors that do influence a social phenomenon. Let’s look at a recent and rather controversial example.

The movement Black Lives Matter, originating in the United States and that has since fired up young activists across the world, has come under sustained criticism recently. Many arguments, particularly in unmoderated settings, rely on stand alone quoting of statistics — saying that fewer black men have been shot by police than fewer white men, black people killing black people is a far higher number, so the accusation of racist police brutality doesn’t hold up, etc. Without delving into detail about the legitimacy of these claims, what’s glaringly obvious is what they don’t account for. They don’t, for instance account for the fact that the numbers have to be adjusted for the fact that black people are a minority in these countries, or that the killings which provoked outrage were of unarmed men, or that many of the shooting of white men, have been in drug busts, a scenario that regularly bodes ill for those involved. This is not to say that those quoting statistics against a movement purportedly for social equity are always wrong. But quantifying, or attempting to discern patterns in behaviour through simplistic analogies and equivalences is a grave error. Multiple, and indeed infinite factors come into play when studying a social phenomenon and careless associations can be traumatizingly destructive.

Another example, one that more clearly denotes this lack would be studies about sexual behaviour and proclivities of human beings, one could benefit greatly from the application of big data analytics. Particulary studies that purport to reflect the influence of pornography on young people, on sexual violence, and on increased sexual satisfaction among consenting adults.

Studies on the influence of pornography on sexual violence or sexually aggressive behaviour have frequently reached conclusions based on statistics of reported sexual violence during a time frame, relative to the easy availability of, increase in access to, and increase in consumption of pornography. Such studies, regularly conducted in developed countries with moderately egalitarian social norms and effective law enforcement, report that a rise in consumption of pornography has not been correlated with a rise in sexual violence. What is never mentioned however, is whether the following factors could influence the numbers of reported acts of sexual aggression - Changes in handling by law enforcement, changing access to justice, the growing influence of feminism, victories of the gender equality movement and egalitarian attitudes, media scrutiny or availability of sex education, all of which might have also had an effect on the numbers. Studies have also failed to take into account that sexual violence is highly under-reported, and could take place in intimate settings where sexual aggression is normalized. Moreover, pornography’s influence on sexual behaviour may manifest itself in other ways — such as greater demands among couples, sexual dissatisfaction, increase in demands for sexting, increased street harassment and objectification. These could be indicative of increasing sexual aggression, or they could not, but their evolution has to be tracked over time, and conclusions about pornography’s influence must be appropriately weighted after considering other social phenomena that may influence sexual behaviour and aggression, positively or negatively. Processing and understanding data pertaining to all of these factors is well beyond the resource capabilities of small research groups studying these subjects especially if the study is large enough to be representative. This is where, it can be argued data science and methods of big data analytics can help eliminate false correlations, conclusions and over-estimations.

For example, classification and class probability estimation tasks attempt to predict whether a particular individual belongs to a smaller set of classes. One can immediately imagine, for instance that this would enable subjects being preliminarily classified according to frequency of pornography usage. Clustering techniques, fairly basic and belonging to the same level in data analytics as classification, clusters individuals in a population together by their similarity, but not driven by any specific purpose. It would therefore, find all subjects preferring the same source of pornography. Here is where it gets interesting. Co-occurrence grouping or association rule analysis finds associations between entities based on transactions involving them. One of the crucial elements missing in studies about pornography, I found, is that they did not account for the fact that the reporting of sex crimes is almost entirely depended on the reputation and perception of law enforcement. The rates of sexual crimes, and indeed the rate of reporting is inextricably tied to the public reputation of law enforcement, and any study that seeks to arrive at a reliable conclusion has to weigh the number of offenses relative to the effectiveness of law enforcement in the area generally, their public image, and to the rates of other crimes in the neighbourhood. Regression tasks attempt to estimate or predict, the numerical value of a variable for an individual, indicating consistencies or patterns in their behaviour. Companies use these tasks to estimate which customer is likely to return for what product. In academic research, these algorithms will indicate the likelihood of a frequent user of pornography, or a user of a certain type of pornography, towards having a predilection for requesting sexually explicit photographs of partners. Profiling techniques attempt to characterize the typical behaviour of an individual, group or population and relies on historical data in order to predict behaviour. Profiling of course, is one of the more advanced techniques in data analytics, and one repeatedly relied upon in predictive analytics. One could envision, for example that profiling would accurate estimate the trends and the average sexual activity between long-time partners, and implications of a change in the average.

Such methods provide for statistical analysis using multiple variables and factors, and enable tracking of various influences. Especially in studies seeking to understand criminal or disruptive behaviour, data science has the potential to be revolutionary and enable recommendations based on data and not guesswork. By changing the input, adding variables noting the change in the output and retracing the algorithmic model that the system develops in order to arrive at it’s output, one can isolate a particular social factor that has the most definitive impact on a specified outcome. Data analytics won’t tell us the answer to our problems, but there is a much larger possibility that it will tell us what the weak links in the chain are, and where the problem lies. It can tell us what trends we can expect, and where resources should be focused. But most of all, it’s becoming very clear that studies which rely on conclusions based on big data analytics will simply be far more reliable and accurate, and be based on the entire population, rather than a sample. Hopefully, with better and more honest data, insights gained can be used to protect society and especially the vulnerable in society, better.

Writer. Human Rights Lawyer. Feminist. Tech Geek. Governance Specialist. Fixing the world is not impossible. New followers, do check out