World Happiness Report 2023 155 and temporal stability of estimates, including improved measurement resolution across time and space (e.g., county–months). As such, it unlocks the control needed for quasi-experimental designs. However, disadvantages include higher complexity in collecting and analyzing person-level time series data (including the need for higher security and data warehousing). It may also be challenging to collect enough data for higher spatiotemporal resolutions (e.g., resolutions down to the county-day). Summary and Future Directions A full methodological toolkit to address biases and provide accurate measurement Regarding the question of self-presentation biases, while they can lead keyword-based dictionary methods astray (Level 1; as discussed in the section Addressing Social Media Biases), research indicates that these biases have less impact on machine learning algorithms fit to representative samples (Level 2) that consider the entire vocabulary to learn language associations, rather than just considering pre-selected keywords out of context (Fig. 5.5).117 Instead of relying on assumptions about how words relate to well-being (which is perilous due to most words having many senses, and words generally only conveying their full meaning in context),118 Level 2 open-vocabulary and machine-learning methods derive relations between language and well-being statistically. Machine-learning-based social media estimates can show strong agreement with assessments from extra-linguistic sources, such as survey responses, and demonstrate that, at least to machine-learning models, language use is robustly related to well-being.119 Person-level approaches (Gen 2) take large steps towards addressing the problems of the potential influence of social media bots. The person-level aggregation facilitates the reliable identification and removal of bots from the dataset. This reduces their influence on the estimates.120 Further, the post-stratified person-level-aggregation methods address the problem that selection biases dominate social media analysis. There is an important difference between non-representative data and somebody not being represented in the data “at all” (i.e., every group may be represented, but they are relatively under- or over-represented) – using robust post-stratification methods can correct non-representative data towards representativeness (as long as demographic strata are sufficiently represented in the data). Lastly, the digital cohort design (Gen 3) overcomes the shortcomings of data aggregation strategies that rely on random samples of tweets from changing samples of users. Instead, ongoing research shows the possibility of following a well-characterized sample over time and “sampling” from it through unobtrusive social media data collection. This approach opens the door to the toolkit of quasi-experimental methods and to meaningful data linkage with other fine-grained population monitoring efforts in population health. Limitations: Language evolves in space and time Regional semantic variation. One challenge of using language across geographic regions and time periods is that words (and their various senses) vary with location and time. Geographic and temporal predictions pose different difficulties: Geographically, some words express subcultural differences (e.g., “jazz” tends to refer to music, but in Utah, it often refers to the Utah Jazz basketball team). Some words are also used in ways that are temporally dependent (e.g., happy is, for example, frequently invoked in Happy New Year, which is a speech act with high frequency – on January 1st, while at other times, it may refer to an emotion or evaluation/judgment (e.g., “happy about,” “a happy life”). Language use is also demographically dependent (“sick” means different things among youths and older adults). While Level 3 approaches (contextual word embeddings) can typically disambiguate word senses, there are also examples where Level 2 methods (data-driven topics) have been successfully used to model regional lexical variation.121 It is important to examine the covariance structure of the most influential words in language models with markers of cultural and socioeconomic gradients.122 Semantic drift (over time). Words in natural languages are also subject to drifts in meaning
RkJQdWJsaXNoZXIy NzQwMjQ=