I tried out three different classifiers to find my optimal model: Regularised regression, random forest, and a support vector machine. It turned out that a lasso regression performed best in my case, as it was the best trade-off between accuracy and computational complexity (for fairness, we had to run the models on our personal computers). I used grid search to optimise lambda (the tuning parameter of the lasso/ridge models).
As we used a bag-of-words representation, I also used grid search for the text preprocessing choices (for example weight, removal of stopwords, punctuation, whitespace etc.). Furthermore, I added a few new features, such as measures for latent concepts like "anger" or "disgust" or lexical complexity. This resulted in my final model, which was the third-best in my class and was only marginally less accurate than that presented by our professor.
Unfortunately, I am prohibited from sharing my code for this assignment, as the school might want to reuse it in the future.
I was intrigued by the assignment, but a bit frustrated by the rather small dataset, which restricted the model choice (i.e. its not large enough for a neural network). Thus, I decided to download the full dataset and train a neural network on that data, hoping to improve the classification. The original dataset contains 150'000 classified comments and additional classes, such as threat or identity hate.