Up until 2023, Twitter provided an Application Programming Interface (API) that allowed researchers to access a wide range of data from the platform. This API does no longer exists in the format we used for our research purposes.
Indeed, it served as a valuable resource for academics, data scientists, and researchers interested in studying various aspects of social behavior, public opinion, and communication trends. To collect diabetes-related tweets, we defined a list of diabetes-related keywords, such as insulin, blood glucose, metformin and related hashtags. A more general health-related data collection was also launched in 2020, collecting tweets related to different topics such as diseases (e.g. cancer), mental health (e.g. and Thanks to this API, we were able to collect more than a billion tweets containing at least one of these keywords and user metadata to analyze topics of interest, sentiment, emotions.
All collected tweets were stored in a secured database in the Luxembourg Institute of Health.
We used Artificial Intelligence (AI) techniques to analyze the tweets. These techniques allowed us to understand patterns that would be impossible for a human to sift through quickly. Specifically, we used:
This is used to understand the language in the tweets to understand the meaning and context of the tweets. It was used for sentiment and emotions analysis (to determine sentiment and emotions in the text), topic modeling (to identify the main themes in a large corpus of text) and keyword extraction (to identify the most important words and phrases within a document).
We use advanced AI to sort tweets into various categories based on their content. This was used for instance, to filter personal tweets from people with diabetes from institutional content and jokes.
We used these techniques to estimate the type of diabetes a person might have based on the words they use in their tweets, but also to predict the rates of anger, fear, sadness and fear in each tweet.
This helped us to figure out what topics people were most concerned about when they were tweeting about diabetes.
Using these AI techniques, we identify the challenges and concerns that people with diabetes face every day and had a better comprehension of diabetes burden.
Working with social media data, particularly when it involves sensitive health-related information, comes with substantial ethical responsibilities. Even though the data from tweets is publicly available, we respected the privacy implications. We adhered to legal frameworks such as the General Data Protection Regulation (GDPR) in the European Union to protect individual information.
Moreover, we consulted the Luxembourg Agency for Research Integrity (LARI). They provided us with valuable guidelines that we have integrated into our data analysis and handling processes to bolster data privacy and ethical considerations. Thus, guided by LARI’s guidelines, we have created a set of best practices. We have the right to use but not own the data; we publish only aggregated data and never individual posts to avoid reidentification; and we refrain from direct social media interactions concerning individuals’ diabetes experiences. We also ensure that all data is stored securely. Finally, our analyses are tailored to the specific aims and objectives of our research, ensuring that the data is not misused or taken out of context.
By following these rules and advice, we aim to do our research in a clear, careful, and ethical way.