World Happiness Report 2023 144 Generations of sampling and dataaggregation methods The following methodological review is organized by generations of data aggregation methods (Gen 1, 2, and 3), which we observed to be the primary methodological choice when working with social media data. But within these generations, the most important distinction in terms of reliability is the transition from dictionary-based (word-level) Level 1 approaches to those relying on machine learning to train language models (Level 2) and beyond. Gen 1: Random Samples of Social Media Posts Initially, a prototypical example of analyzing social media language for population assessments involved simply aggregating posts geographically or temporally – e.g., a random sample of tweets from the U.S. for a given day. In this approach, the aggregation of language is carried out based on a naive sampling of posts – without taking into account the people writing them (see Fig. 5.3). The language analysis was typically done using a Level 1 closed-vocabulary approach – for example, the LIWC positive emotion dictionary was applied to word counts. Later, Level 2 approaches have been used with random samples of tweets, such as open-vocabulary approaches based on Box 5.1: Effects of bots on social media measurement On social media, bots are accounts that automatically generate content, such as for marketing purposes, political messages, and misinformation (fake news). Recent estimates suggest that 8 – 18% of Twitter accounts are bots55 and that these accounts tend to stay active for between 6 months to 2.5 years.56 Historically, bots were used to spread unsolicited content or malware, inflate follower numbers, and generate content via retweets.57 More recently, bots have been found to play a large part in spreading information from low- credibility sources; for example, targeting individuals with many followers through mentions and replies.58 More sophisticated bots, namely social spambots, are now interacting with and mimicking humans while evading standard detection techniques.59 There is concern that the growing sophistication of generative language models (such as GPT) may lead to a new generation of bots that become increasingly harder to distinguish from human users. How bots impact measurement of well-being using social media The content generated by bots should not, of course, influence the assessment of human well-being. While bots compose fewer original tweets than humans, they have been shown to express sentiment and happiness patterns that differ from the human population.60 Applying the person-level aggregation (Gen 2) technique effectively limits the bot problem since all their generated content is aggregated into a single “data point.” Additional heuristics, such as removing retweets, should minimize the bot problem by removing content from retweet bots. Finally, work has shown that bots exhibit extremely average human-like characteristics, such as estimated age and gender.61 Thus, applying post-stratification techniques downweight bots in the aggregation process since accounts with average demographics will be over-represented in the sample. With modern machine learning systems, bots can be detected and removed.62
RkJQdWJsaXNoZXIy NzQwMjQ=