Mona Ascha, MD - propaganda

Service:

Machine Learning

Industry:

Data Science

Year:

2024

View Live:

propaganda

About the project

‍

The project aimed to develop a machine learning model to classify Russian-language texts into two categories: propaganda and not propaganda. Data was collected from January 2022 to July 2023, with a total of 14,011 texts manually labeled by trained reviewers. The dataset included raw and translated texts.

We conducted extensive exploratory data analysis (EDA) to uncover key text features, such as word and character counts, sentiment distributions, and common words. Texts classified as propaganda were found to be shorter in length. Sentiment analysis using lexicon-based and machine-learning approaches revealed that "fear" was the most common emotion in propaganda texts.

Advanced text-processing techniques like tokenization, lemmatization, and TF-IDF vectorization were applied. Machine learning models including logistic regression, support vector classifiers (SVC), and XGBoost were tested. Data preprocessing steps were optimized to enhance performance, such as custom stopword removal and tailored TF-IDF tokenization. MLFlow was used for hyperparameter testing. After rigorous evaluation, the best-performing model was selected based on F1-score, AUC, and overall predictive accuracy.

Results

‍

The calibrated SVC model with full preprocessing using TF-IDF unigrams and bigrams achieved the best performance:

F1-Score: 0.874
AUC (Area Under Curve): 0.928
Test Set Accuracy: 85.9%
Model Strengths:
- High precision and recall for both classes.
- Significant performance improvement with TF-IDF and bigram inclusion.
- Robust model interpretation and calibration using Brier loss and ROC curves.

The code for this work is proprietary and unfortunately not available to the public.

Next Work

rhinonet