Catalogue Search | MBRL

Studying user income through language, behaviour and affect in social media

by Lampos, Vasileios , Volkova, Svitlana , Bachrach, Yoram in Affect , Age differences , Artificial intelligence

2015

Automatically inferring user demographics from social media posts is useful for both social science research and a range of downstream applications in marketing and politics. We present the first extensive study where user behaviour on Twitter is used to build a predictive model of income. We apply non-linear methods for regression, i.e. Gaussian Processes, achieving strong correlation between predicted and actual user income. This allows us to shed light on the factors that characterise income on Twitter and analyse their interplay with user emotions and sentiment, perceived psycho-demographics and language use expressed through the topics of their posts. Our analysis uncovers correlations between different feature categories and income, some of which reflect common belief e.g. higher perceived education and intelligence indicates higher earnings, known differences e.g. gender and age differences, however, others show novel findings e.g. higher income users express more fear and anger, whereas lower income users express more of the time emotion and opinions.

Journal Article

Share this book

Add to My Shelf

Forecasting influenza-like illness dynamics for military populations using neural networks and social media

by Porterfield, Katherine , Volkova, Svitlana , Corley, Courtney D. in 60 APPLIED LIFE SCIENCES , Activity patterns , Artificial intelligence

2017

This work is the first to take advantage of recurrent neural networks to predict influenza-like illness (ILI) dynamics from various linguistic signals extracted from social media data. Unlike other approaches that rely on timeseries analysis of historical ILI data and the state-of-the-art machine learning models, we build and evaluate the predictive power of neural network architectures based on Long Short Term Memory (LSTMs) units capable of nowcasting (predicting in \"real-time\") and forecasting (predicting the future) ILI dynamics in the 2011 - 2014 influenza seasons. To build our models we integrate information people post in social media e.g., topics, embeddings, word ngrams, stylistic patterns, and communication behavior using hashtags and mentions. We then quantitatively evaluate the predictive power of different social media signals and contrast the performance of the-state-of-the-art regression models with neural networks using a diverse set of evaluation metrics. Finally, we combine ILI and social media signals to build a joint neural network model for ILI dynamics prediction. Unlike the majority of the existing work, we specifically focus on developing models for local rather than national ILI surveillance, specifically for military rather than general populations in 26 U.S. and six international locations., and analyze how model performance depends on the amount of social media data available per location. Our approach demonstrates several advantages: (a) Neural network architectures that rely on LSTM units trained on social media data yield the best performance compared to previously used regression models. (b) Previously under-explored language and communication behavior features are more predictive of ILI dynamics than stylistic and topic signals expressed in social media. (c) Neural network models learned exclusively from social media signals yield comparable or better performance to the models learned from ILI historical data, thus, signals from social media can be potentially used to accurately forecast ILI dynamics for the regions where ILI historical data is not available. (d) Neural network models learned from combined ILI and social media signals significantly outperform models that rely solely on ILI historical data, which adds to a great potential of alternative public sources for ILI dynamics prediction. (e) Location-specific models outperform previously used location-independent models e.g., U.S. only. (f) Prediction results significantly vary across geolocations depending on the amount of social media data available and ILI activity patterns. (g) Model performance improves with more tweets available per geo-location e.g., the error gets lower and the Pearson score gets higher for locations with more tweets.

Journal Article

Share this book

Add to My Shelf

Studying information recurrence, gatekeeping, and the role of communities during internet outages in Venezuela

by Saldanha, Emily , Thomas, Pamela Bilo , Volkova, Svitlana in 639/705 , 639/705/1042 , 639/705/117

2021

Many authoritarian regimes have taken to censoring internet access in order to stop the spread of misinformation, restrict citizens from discussing certain topics, and prevent mobilization, among other reasons. There are several theories about the effectiveness of censorship. Some suggest that censorship will effectively limit the flow of information, whereas others predict that a backlash will form, resulting in ultimately more discussion about the topic. In this work, we analyze the role of communities and gatekeepers during multiple internet outages in Venezuela in January 2019. First, we measure how critical information (e.g., entities and hashtags) spreads during outages focusing on information recurrence and burstiness within and across language and location communities. We discover that information bursts tend to cross both language and location community boundaries rather than being limited to a single community during several outages. Then we identify users who play central roles and propose a novel method to detect gatekeepers—users who prevent critical information from spreading across communities during outages. We show that bilingual and English-speaking users play more central roles compared to Spanish-speaking users, but users inside and outside Venezuela have similar distribution of centrality. Finally, we measure the differences in social network structure before and after each outage event and discuss its effect on how information spreads. We find that with each outage event social connections tend to get less connected with higher mean shortest path, indicating that the effect of censorship makes it harder for information to spread.

Journal Article

Share this book

Add to My Shelf

Multiple social platforms reveal actionable signals for software vulnerability awareness: A study of GitHub, Twitter and Reddit

by Volkova, Svitlana , Sathanur, Arun , Shrestha, Prasha in Community structure , Computer and Information Sciences , Computer programs

2020

The awareness about software vulnerabilities is crucial to ensure effective cybersecurity practices, the development of high-quality software, and, ultimately, national security. This awareness can be better understood by studying the spread, structure and evolution of software vulnerability discussions across online communities. This work is the first to evaluate and contrast how discussions about software vulnerabilities spread on three social platforms-Twitter, GitHub, and Reddit. Moreover, we measure how user-level e.g., bot or not, and content-level characteristics e.g., vulnerability severity, post subjectivity, targeted operating systems as well as social network topology influence the rate of vulnerability discussion spread. To lay the groundwork, we present a novel fundamental framework for measuring information spread in multiple social platforms that identifies spread mechanisms and observables, units of information, and groups of measurements. We then contrast topologies for three social networks and analyze the effect of the network structure on the way discussions about vulnerabilities spread. We measure the scale and speed of the discussion spread to understand how far and how wide they go, how many users participate, and the duration of their spread. To demonstrate the awareness of more impactful vulnerabilities, a subset of our analysis focuses on vulnerabilities targeted during recent major cyber-attacks and those exploited by advanced persistent threat groups. One of our major findings is that most discussions start on GitHub not only before Twitter and Reddit, but even before a vulnerability is officially published. The severity of a vulnerability contributes to how much it spreads, especially on Twitter. Highly severe vulnerabilities have significantly deeper, broader and more viral discussion threads. When analyzing vulnerabilities in software products we found that different flavors of Linux received the highest discussion volume. We also observe that Twitter discussions started by humans have larger size, breadth, depth, adoption rate, lifetime, and structural virality compared to those started by bots. On Reddit, discussion threads of positive posts are larger, wider, and deeper than negative or neutral posts. We also found that all three networks have high modularity that encourages spread. However, the spread on GitHub is different from other networks, because GitHub is more dense, has stronger community structure and assortativity that enhances information diffusion. We anticipate the results of our analysis to not only increase the understanding of software vulnerability awareness but also inform the existing and new analytical frameworks for simulating information spread e.g., disinformation across multiple social environments online.

Journal Article

Share this book

Add to My Shelf

Uncovering the relationships between military community health and affects expressed in social media

by Volkova, Svitlana , Harrison, Josh , Corley, Courtney D in 60 APPLIED LIFE SCIENCES , Armed forces , biosurveillance

2017

Military populations present a small, unique community whose mental and physical health impacts the security of the nation. Recent literature has explored social media’s ability to enhance disease surveillance and characterize distinct communities with encouraging results. We present a novel analysis of the relationships between influenza-like illnesses (ILI) clinical data and affects (i.e., emotions and sentiments) extracted from social media around military facilities. Our analyses examine (1) differences in affects expressed by military and control populations, (2) affect changes over time by users, (3) differences in affects expressed during high and low ILI seasons, and (4) correlations and cross-correlations between ILI clinical visits and affects from an unprecedented scale - 171M geo-tagged tweets across 31 global geolocations. Key findings include: Military and control populations differ in the way they express affects in social media over space and time. Control populations express more positive and less negative sentiments and less sadness , fear , disgust , and anger emotions than military. However, affects expressed in social media by both populations within the same area correlate similarly with ILI visits to military health facilities. We have identified potential responsible cofactors leading to location variability, e.g., region or state locale, military service type and/or the ratio of military to civilian populations. For most locations, ILI proportions positively correlate with sadness and neutral sentiment, which are the affects most often expressed during high ILI season. The ILI proportions negatively correlate with fear , disgust , surprise , and positive sentiment. These results are similar to the low ILI season where anger , surprise , and positive sentiment are highest. Finally, cross-correlation analysis shows that most affects lead ILI clinical visits, i.e. are predictive of ILI data, with affect-ILI leading intervals dependent on geolocation and affect type. Overall, information gained in this study exemplifies a usage of social media data to understand the correlation between psychological behavior and health in the military population and the potential for use of social media affects for prediction of ILI cases.

Journal Article

Share this book

Add to My Shelf

Explaining and predicting human behavior and social dynamics in simulated virtual worlds: reproducibility, generalizability, and robustness of causal discovery methods

by Volkova, Svitlana , Saldanha, Emily , Aksoy, Sinan in Artificial intelligence , Behavior , Causal models

2023

Ground Truth program was designed to evaluate social science modeling approaches using simulation test beds with ground truth intentionally and systematically embedded to understand and model complex Human Domain systems and their dynamics Lazer et al. (Science 369:1060–1062, 2020). Our multidisciplinary team of data scientists, statisticians, experts in Artificial Intelligence (AI) and visual analytics had a unique role on the program to investigate accuracy, reproducibility, generalizability, and robustness of the state-of-the-art (SOTA) causal structure learning approaches applied to fully observed and sampled simulated data across virtual worlds. In addition, we analyzed the feasibility of using machine learning models to predict future social behavior with and without causal knowledge explicitly embedded. In this paper, we first present our causal modeling approach to discover the causal structure of four virtual worlds produced by the simulation teams—Urban Life, Financial Governance, Disaster and Geopolitical Conflict. Our approach adapts the state-of-the-art causal discovery (including ensemble models), machine learning, data analytics, and visualization techniques to allow a human-machine team to reverse-engineer the true causal relations from sampled and fully observed data. We next present our reproducibility analysis of two research methods team’s performance using a range of causal discovery models applied to both sampled and fully observed data, and analyze their effectiveness and limitations. We further investigate the generalizability and robustness to sampling of the SOTA causal discovery approaches on additional simulated datasets with known ground truth. Our results reveal the limitations of existing causal modeling approaches when applied to large-scale, noisy, high-dimensional data with unobserved variables and unknown relationships between them. We show that the SOTA causal models explored in our experiments are not designed to take advantage from vasts amounts of data and have difficulty recovering ground truth when latent confounders are present; they do not generalize well across simulation scenarios and are not robust to sampling; they are vulnerable to data and modeling assumptions, and therefore, the results are hard to reproduce. Finally, when we outline lessons learned and provide recommendations to improve models for causal discovery and prediction of human social behavior from observational data, we highlight the importance of learning data to knowledge representations or transformations to improve causal discovery and describe the benefit of causal feature selection for predictive and prescriptive modeling.

Journal Article

Share this book

Add to My Shelf

Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals

by Volkova, Svitlana , Gerard, Patrick in Alignment , Annotations , Density

2026

Language models deployed in online communities must adapt to norms that vary across social, cultural, and domain-specific contexts. Prior alignment approaches rely on explicit preference supervision or predefined principles, which are effective for well-resourced settings but exclude most online communities -- particularly those without institutional backing, annotation infrastructure, or organized around sensitive topics -- where preference elicitation is costly, ethically fraught, or culturally misaligned. We observe that communities already express preferences implicitly through what content they accept, engage with, and allow to persist. We show that this acceptance behavior induces measurable geometric structure in representation space: accepted responses occupy coherent, high-density regions that reflect community-specific norms, while rejected content falls in sparser or misaligned areas. We operationalize this structure as an implicit preference signal for alignment and introduce density-guided response optimization (DGRO), a method that aligns language models to community norms without requiring explicit preference labels. Using labeled preference data, we demonstrate that local density recovers pairwise community judgments, indicating that geometric structure encodes meaningful preference signal. We then apply DGRO in annotation-scarce settings across diverse communities spanning platform, topic, and language. DGRO-aligned models consistently produce responses preferred by human annotators, domain experts, and model-based judges over supervised and prompt-based baselines. We position DGRO as a practical alignment alternative for communities where explicit preference supervision is unavailable or misaligned with situated practices, and discuss the implications and risks of learning from emergent acceptance behavior.

Paper

Share this book

Add to My Shelf

Predicting Demographics and Affect in Social Networks

by Volkova, Svitlana in Artificial intelligence , Computer science , Information science

2015

The recent explosion of social media services like Twitter, Facebook and Google+ has led to an interest in predicting hidden information from the large amounts of freely available public content. As compared to the earlier explosion of documents arising from the web, social media content is significantly more personalized – written in the first person, informal, and often revealing of latent attributes of users. The task of inferring latent user properties from social media data has become known as user modeling, personal analytics or user profiling task. Previous approaches treated the task of user attribute prediction as static super- vised classification, applied textual features extracted from user tweets and relied on an unrealistic amount of content per user (thousands of tweets). This dissertation relies mainly on Twitter data and focuses on several important but previously unex- plored aspects of the task of user attribute prediction: (1) developing novel models and practical techniques that reflect the dynamic streaming nature of social media; (2) studying predictive power and latent relationships between user demographics, emotions and interests in social media; and (3) showing that extra-linguistic features such as user demographics, personality and emotions can improve a variety of downstream applications, e.g., sentiment analysis and attribute-affect specific language modeling. (Abstract shortened by ProQuest.)

Dissertation

Share this book

Add to My Shelf

Community-Aligned Behavior Under Uncertainty: Evidence of Epistemic Stance Transfer in LLMs

by Chang, Aiden , Volkova, Svitlana , Gerard, Patrick in Large language models , Mimicry , Uncertainty

2025

When large language models (LLMs) are aligned to a specific online community, do they exhibit generalizable behavioral patterns that mirror that community's attitudes and responses to new uncertainty, or are they simply recalling patterns from training data? We introduce a framework to test epistemic stance transfer: targeted deletion of event knowledge, validated with multiple probes, followed by evaluation of whether models still reproduce the community's organic response patterns under ignorance. Using Russian--Ukrainian military discourse and U.S. partisan Twitter data, we find that even after aggressive fact removal, aligned LLMs maintain stable, community-specific behavioral patterns for handling uncertainty. These results provide evidence that alignment encodes structured, generalizable behaviors beyond surface mimicry. Our framework offers a systematic way to detect behavioral biases that persist under ignorance, advancing efforts toward safer and more transparent LLM deployments.

Paper

Share this book

Add to My Shelf

Identifying Causal Influences on Publication Trends and Behavior: A Case Study of the Computational Linguistics Community

by Volkova, Svitlana , Glenski, Maria in Linguistics , Retirement , Trends

2021

Drawing causal conclusions from observational real-world data is a very much desired but challenging task. In this paper we present mixed-method analyses to investigate causal influences of publication trends and behavior on the adoption, persistence, and retirement of certain research foci -- methodologies, materials, and tasks that are of interest to the computational linguistics (CL) community. Our key findings highlight evidence of the transition to rapidly emerging methodologies in the research community (e.g., adoption of bidirectional LSTMs influencing the retirement of LSTMs), the persistent engagement with trending tasks and techniques (e.g., deep learning, embeddings, generative, and language models), the effect of scientist location from outside the US, e.g., China on propensity of researching languages beyond English, and the potential impact of funding for large-scale research programs. We anticipate this work to provide useful insights about publication trends and behavior and raise the awareness about the potential for causal inference in the computational linguistics and a broader scientific community.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter