Here you’ll find a collection of technical reports, scientific publications and our open-source software and data sets.

Textgain Technical Reports (ISSN 2684-4842)

A technical report series detailing ongoing work at Textgain

MAL NLP Lexicon: Melancholy, Anxiety & Loneliness during lockdown

Abstract Authors Abstract We have created a new practice-based NLP resource for monitoring mental health on social media, in particular brooding. The resource is currently available for Dutch and captures 2,000+ expressions of anger, fear and sadness, along with various … Read More

4chan & 8chan embeddings (TGTR-1)

Abstract Authors Abstract We have collected over 30 million messages from the publicly available /pol/ message boards on 4chan and 8chan, and compiled them into a model of toxic language use. The trained word embeddings (±0.4GB) are released for free … Read More


Publications co-authored by Textgain team members. Some of these are available for download from this site.

Online hatred of women in the forum – Linguistic analysis and automatic detection

Automatic detection of cyberbullying in social media text

Summary Authors Summary While social media offer great communication opportunities, they also increase the vulnerability of young people to threatening situations online. Recent studies report that cyberbullying constitutes a growing problem among youngsters. Successful prevention depends on the adequate detection … Read More

Text-Based Age and Gender Prediction for Online Safety Monitoring

Automatic Detection of Online Jihadist Hate Speech

Summary Author Summary We have developed a system that automatically detects online jihadist hate speech with over 80% accuracy, by using techniques from Natural Language Processing and Machine Learning. The system is trained on a corpus of 45,000 subversive Twitter … Read More

Multilingual Cross-domain Perspectives on Online Hate Speech


True to our roots, we have several open-source libraries available on Github.


Arabic Dialect Identification


GDPR Anonymization Tool


Data sets

Whenever possible, we like to make our data sets available. This is not always possible due to GDPR restrictions, but we share whatever we can.

4chan & 8chan Word Embeddings

Dutch Word Embeddings

Our latest blog posts

Covid-19: Remèdes et complots de la twittosphère francophone

Textgain s’est donné pour mission de détecter et combattre la prolifération de désinformations et de discours polarisants sur différentes plateformes du Web. L’actuelle crise du coronavirus ne fait pas exception et de nombreuses fausses informations se répandent sur les réseaux … Read More

Featured Post

Het coronapanacee en complottheorieën: sociale media in tijden van crisis

Textgain zet zich via verschillende kanalen in om de verspreiding van polarisering en online desinformatie te detecteren en tegen te gaan. Omtrent de huidige coronacrisis wordt massaal nepnieuws verspreid via social media. Onze Data Scientist Elizabeth Cappon maakte een overzicht. … Read More

Featured Post

Detect Then ACT (DeTACT): “Taking Direct Action against Online Hate Speech by Turning Bystanders into Upstanders”

Cross-border EU initiative to help counter online hate speech Antwerp, Belgium, February 24, 2020 Universities, tech companies, NGO’s and citizens in Belgium, Germany and the Netherlands, supervised by a board of security, legal and ethics experts, are working together to … Read More

Featured Post
Create a free account or contact sales