propaganda

This work develops machine learning models using natural language processing to classify Russian language texts as propaganda versus not.

propaganda
Service:
Machine Learning
Industry:
Data Science
Year:
2024

About the project

The project aimed to develop a machine learning model to classify Russian-language texts into two categories: propaganda and not propaganda. Data was collected from January 2022 to July 2023, with a total of 14,011 texts manually labeled by trained reviewers. The dataset included raw and translated texts.

We conducted extensive exploratory data analysis (EDA) to uncover key text features, such as word and character counts, sentiment distributions, and common words. Texts classified as propaganda were found to be shorter in length. Sentiment analysis using lexicon-based and machine-learning approaches revealed that "fear" was the most common emotion in propaganda texts.

Advanced text-processing techniques like tokenization, lemmatization, and TF-IDF vectorization were applied. Machine learning models including logistic regression, support vector classifiers (SVC), and XGBoost were tested. Data preprocessing steps were optimized to enhance performance, such as custom stopword removal and tailored TF-IDF tokenization. MLFlow was used for hyperparameter testing. After rigorous evaluation, the best-performing model was selected based on F1-score, AUC, and overall predictive accuracy.

propaganda

Results

The calibrated SVC model with full preprocessing using TF-IDF unigrams and bigrams achieved the best performance:

  • F1-Score: 0.874
  • AUC (Area Under Curve): 0.928
  • Test Set Accuracy: 85.9%
  • Model Strengths:
    • High precision and recall for both classes.
    • Significant performance improvement with TF-IDF and bigram inclusion.
    • Robust model interpretation and calibration using Brier loss and ROC curves.

The code for this work is proprietary and unfortunately not available to the public.

propaganda