Contact us
Contact us

Publications

Browse through our publications on language processing technology and social media trend detection. We have an extensive library of technical reports and list a few of our featured appearances in the media.

(Scientific) publicationsTechnical reportsMedia appearance
custom_hero_background

Discover our (scientific) publications

Handbook of Hate Memes

If you ever wondered whether we are secretly ruled by alien reptilian overlords, when the Dark Enlightenment’s acceleration will begin, what the tailless amphibian wildlife in Kekistan looks like, or how to spot an Antifa tank – seek no further! Descend into the bizarre toxicity on 4chan, Telegram, Gab, and mainstream social media platforms we love so much.

Request access

Automatic detection of cyberbullying in social media text

While social media offer great communication opportunities, they also increase the vulnerability of young people to threatening situations online. Recent studies report that cyberbullying constitutes a growing problem among youngsters. Successful prevention depends on the adequate detection of potentially harmful messages and the information overload on the Web requires intelligent systems to identify potential risks automatically. The focus of this paper is on automatic cyberbullying detection in social media text by modelling posts written by bullies, victims, and bystanders of online bullying. We describe the collection and fine-grained annotation of a cyberbullying corpus for English and Dutch and perform a series of binary classification experiments to determine the feasibility of automatic cyberbullying detection. We make use of linear support vector machines exploiting a rich feature set and investigate which information sources contribute the most for the task. Experiments on a hold-out test set reveal promising results for the detection of cyberbullying-related posts. After optimisation of the hyperparameters, the classifier yields an F1 score of 64% and 61% for English and Dutch respectively, and considerably outperforms baseline systems.

Author

Tom De Smedt

Read

Text-Based Age and Gender Prediction for Online Safety Monitoring

This paper explores the capabilities of text-based age and gender prediction geared towards the application of detecting harmful content and conduct on social media. More specifically, we focus on the use case of detecting sexual predators who try to “groom” children online and possibly provide false age and gender information in their user profiles. We perform age and gender classification experiments on a dataset of nearly 380,000 Dutch chat posts from a social network. We evaluate and compare binary age classifiers trained to separate younger and older authors according to different age boundaries and find that macro-averaged Fscores increase when the age boundary is raised. Furthermore, we show that use-case applicable performance levels can be achieved for the classification of minors versus adults, thereby providing a useful component in a cybersecurity monitoring tool for social network moderators.

Authors

Janneke van de Loo
Guy De Pauw
Walter Daelemans

Read

Multilingual Cross-domain Perspectives on Online Hate Speech

In this report, we present a study of eight corpora of online hate speech, by demonstrating the NLP techniques that we used to collect and analyze the jihadist, extremist, racist, and sexist content. Analysis of the multilingual corpora shows that the different contexts share certain characteristics in their hateful rhetoric. To expose the main features, we have focused on text classification, text profiling, keyword and collocation extraction, along with manual annotation and qualitative study.

Authors

Tom De Smedt
Sylvia Jaki
Eduan Kotzé
Leïla Saoud
Maja Gwóźdź
Guy De Pauw
Walter Daelemans

Read

QAnon: Spreading Conspiracy Theories on Twitter

From 1st October to 5th November 2020, Textgain analyzed half a million Twitter messages related to QAnon conspiracy theories, using our Natural Language Processing (NLP) technology. This report outlines the results of the findings of our quantitative analysis, as well as the qualitative analysis by the partners of the Get The Trolls Out! project.

Authors

Tom De Smedt (Textgain)
Verica Rupar (Auckland University of Technology)

Read

Automatic Detection of Online Jihadist Hate Speech

We have developed a system that automatically detects online jihadist hate speech with over 80% accuracy, by using techniques from Natural Language Processing and Machine Learning. The system is trained on a corpus of 45,000 subversive Twitter messages collected from October 2014 to December 2016. We present a qualitative and quantitative analysis of the jihadist rhetoric in the corpus, examine the network of Twitter users, outline the technical procedure used to train the system, and discuss examples of use.

Authors

Tom De Smedt
Guy De Pauw
Pieter Van Ostaeyen

Read

Using a Personality-Profiling Algorithm to Investigate Political Microtargeting

Political advertisers have access to increasingly sophisticated microtargeting techniques. One such technique is tailoring ads to the personality traits of citizens. Questions have been raised about the effectiveness of this political microtargeting (PMT) technique. In two experiments, we investigate the causal effects of personality-congruent political ads. In Study 1, we first assess participants’ extraversion trait by means of their own text data (i.e., by using a personality profiling algorithm), and in a second phase, target them with either a personality-congruent or incongruent political ad. In Study 2, we followed the same protocol, but instead targeted participants with emotionally-charged congruent ads, to establish whether PMT can be effective on an affect-based level. The results show evidence that citizens are more strongly persuaded by political ads that match their own personality traits. These findings feed into relevant and timely contributions to a salient academic and societal debate

Authors

Brahim Zarouali (University of Amsterdam)
Tom Dobber (University of Amsterdam)
Guy De Pauw (TEXTGAIN)
Claes de Vreese (University of Amsterdam)

Read

Online hatred of women in the Incels.me forum – Linguistic analysis and automatic detection

This paper presents a study of the (now suspended) online discussion forum Incels.me and its users, involuntary celibates or incels, a virtual community of isolated men without a sexual life, who see women as the cause of their problems and often use the forum for misogynistic hate speech and other forms of incitement. Involuntary celibates have attracted media attention and concern, after a killing spree in April 2018 in Toronto, Canada. The aim of this study is to shed light on the group dynamics of the incel community, by applying mixed-methods quantitative and qualitative approaches to analyze how the users of the forum create in-group identity and how they construct major out-groups, particularly women. We investigate the vernacular used by incels, apply automatic profiling techniques to determine who they are, discuss the hate speech posted in the forum, and propose a Deep Learning system that is able to detect instances of misogyny, homophobia, and racism, with approximately 95% accuracy.

Authors

Sylvia Jaki
Tom De Smedt
Maja Gwóźdź
Rudresh Panchal
Alexander Rossa
Guy De Pauw

Read

Back to resources

Discover our technical reports

4chan & 8chan embeddings (TGTR-1)

We have collected over 30 million messages from the publicly available /pol/ message boards on 4chan and 8chan, and compiled them into a model of toxic language use. The trained word embeddings (±0.4GB) are released for free and may be useful for further study on toxic discourse or to boost hate speech detection systems.

Authors

Pierre Voué
Tom De Smedt
Guy De Pauw

Read

MAL NLP Lexicon: Melancholy, Anxiety & Loneliness during lockdown (TGTR-2)

We have created a new practice-based NLP resource for monitoring mental health on social media, in particular brooding. The resource is currently available for Dutch and captures 2,000+ expressions of anger, fear and sadness, along with various fine-grained mental states like despair, disappointment, hope, guilt, loneliness, melancholy, stress, relief and worry.

Authors

Tom De Smedt (TEXTGAIN)
Sofie Mariën (University of Antwerp)
Guy De Pauw (TEXTGAIN)

Read

Profanity & Offensive Words (P​OW​): Multilingual fine-grained lexicons for hate speech (TGTR-3)

The POW lexicons are a steadily growing, interpretable NLP resource for online hate speech detection. They are currently available in English, German, French, Dutch and Hungarian, capturing thousands of verbal expressions of abusive, aggressive, dehumanizing, discriminatory, offensive and toxic language use,and have been field-tested in real-life applications.

Authors

Tom De Smedt (TEXTGAIN)
Pierre Voué (TEXTGAIN)
Sylvia Jaki (UNI HILDESHEIM)
Melina Röttcher
Guy De Pauw (TEXTGAIN)

Read

GeenStijl.nl embeddings (TGTR-4)

We collected over 8M messages from the controversial Dutch websites GeenStijl and Dumpert to train a word embedding model that captures the toxic language representations contained in the dataset. The trained word embeddings (±150MB) are released for free and may be useful for further study on toxic online discourse.

Authors

Pierre Voué
Elizabeth Cappon
Tom De Smedt

Read

Onze Echokamers: likes onder de loep (TGTR-5)

Textgain onderzocht hoe echokamers automatisch in kaart kunnen worden gebracht aan de hand van publieke data van de twitter-accounts van nieuwssites, influencers en politici. In dit artikel beschrijven we de huidige stand van zaken in het Nederlandse taalgebied.

Author

Elizabeth Cappon

Read

The sexist narrative on alternative social media dissected (TGTR-6)

Sexist messages and ideology remain under the radar and several studies have pointed out that online hate against women is oftentimes ignored and downplayed. In February 2021, we collected more than 100.000 posts from online fringe media platforms (4plebs, 4chan and GAB) containing keywords referring to women. On the basis of 25.000 posts containing the word “women” we attempted to map recurring sexist narratives on fringe media platforms using NLP techniques.

Author

Elizabeth Cappon

Read

Online anti-Semitism across platforms (TGTR-7)

We created a fine-grained AI system for the detection of anti-Semitism. This Explainable AI will identify English and German anti-Semitic expressions of dehumanization, verbal agression and conspiracies in online social media messages across platforms, to support high-level decision making.

Author

Tom De Smedt

Read

Back to resources

Discover our features

Textgain featured in “Text Analytics APIs 2018”

Textgain is featured in Text Analytics APIs 2018: A Consumer Guide, a 300 page report on the state-of-the-art in Text Analytics APis. Robert Dale of the Language Technology Group has compiled a comprehensive report on currently available technologies. A free sample is available here.

Read

Textgain featured in “Benchmark studie over artificiële intelligentie”

PWC published a study of Artificial Intelligence vendors in Flanders. Textgain is also featured in this overview as a spin-off of the University of Antwerp.

Read

Hoe wetenschappelijk is Star Wars?

“The Last Jedi” is de achtste film in de Star Wars-saga, de vorig jaar verschenen “Rogue One” even buiten beschouwing gelaten. Maar wat denken wetenschappers van de reeks?

Read

Back to resources