Big data – which has become quite the buzzword these days – refers to automatically generated information about people’s behaviour. The reason it’s called “big” is because it can easily include millions of observations per single set.
Unlike traditional surveys, based on explicit questions, big data is created whenever people engage in certain actions when using an online service or system – with every click, Facebook, Twitter and other social media users leave behind digital traces of themselves that can be used by businesses, governments and various other groups that rely on “big data”.
But while the information derived from social media networks can certainly shed some light on mass-scale social trends, some analyses based on this method of data collection are prone to biases from the get-go.
This is the conclusion of a new study from the Northwestern University, titled “Is Bigger Always Better? Potential Biases of Big Data Derived from Social Network Sites”.
Published in the journal The Annals of the American Academy of Political and Social Science, the study points out that since people don‘t join social media outlets randomly, the data generated by analysing their online behaviour is potentially biased in terms of demographics, socioeconomic background or Internet skills.
“The problem is that the only people whose behaviours and opinions are represented are those who decided to join the site in the first place,” said study author Eszter Hargittai, the April McClain-Delaney and John Delaney Professor in the School of Communication. “If people are analysing big data to answer certain questions, they may be leaving out entire groups of people and their voices.”
For this particular study, the Web Use Project – Hargittai’s research group that focuses on how differences in Internet use contributes to social inequality – looked at a type of big data analysis that draws wide-ranging conclusions based on data obtained from users of particular sites and services.
While there have already been a number of studies documenting the challenges of big data research, this is the first one to provide empirical evidence of potential biases.
Hargittai used two datasets – a nationally representative sample from the Pew Internet Project and her own data collected from wired and young educated adults.
Turns out, age, gender, financial status and Internet skills all contribute in determining which sites and services people choose to engage in.
“The less privileged are not on these sites [Facebook, Twitter and the like], so their opinions are not there either,” she said. “Even among young adults who are generally thought of as the most active on social network sites, we see socioeconomic differences when it comes to Twitter and Tumblr. We also see gender and skill differences on who is on what site.”
The paper concludes with some preliminary advice on study design aimed at avoiding biases, and suggests supplementing big data with other sources of information to decrease the negative impacts of leaning too hard on social networking sites alone.