Ethical and technical challenges of AI in tackling hate speech


  • Diogo Cortiz Pontifícia Universidade Católica de São Paulo (PUC-SP)
  • Arkaitz Zubiaga Queen Mary University of London (QMUL)



Artificial Intelligence, Ethics, Online Harms, Hate Speech, Bias


In this paper, we discuss some of the ethical and technical challenges of using Artificial Intelligence for online content moderation. As a case study, we used an AI model developed to detect hate speech on social networks, a concept for which varying definitions are given in the scientific literature and consensus is lacking. We argue that while AI can play a central role in dealing with information overload on social media, it could cause risks of violating freedom of expression (if the project is not well conducted). We present some ethical and technical challenges involved in the entire pipeline of an AI project - from data collection to model evaluation - that hinder the large-scale use of hate speech detection algorithms. Finally, we argue that AI can assist with the detection of hate speech in social media, provided that the final judgment about the content has to be made through a process with human involvement.


Angwin, J., Larson, J., Mattu, S., & Kirchner, L. Machine Bias. ProPublica, 2016.

Barker, K. and Jurasz, O. Online Harms White Paper Consultation Response. Striling Law School & The Open University Law School, 2019.

Beadle, S. How does the Internet facilitate radicalization? London, England: War Studies Department, King’s College, 2017.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., & Amodei, D. Language Models are Few-Shot Learners, 2020.

Cortiz, D. O Design pode ajudar na construção de Inteligência Artificial humanística?, p. 14-22 . In: 17º Congresso Internacional de Ergonomia e Usabilidade de Interfaces Humano-Tecnologia e o 17 º Congresso Internacional de Ergonomia e Usabilidade de Interfaces e Interação Humano-Computador. São Paulo: Blucher, 2019.

ISSN 2318-6968, DOI 10.5151/ergodesign2019-1.02

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018.

Ellison, N. B., & Boyd, D. M. Sociality Through Social Network Sites (W. H. Dutton (ed.); Vol. 1). Oxford University Press, 2013.

Facebook (2020). Community Standards;. Available from: objectionable content.

Fleiss, J. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971.

Fortuna, P., Rocha da Silva, J., Soler-Company, J., Wanner, L., & Nunes, S. A Hierarchically-Labeled Portuguese Hate Speech Dataset. Proceedings of the Third Workshop on Abusive Language Online, 94–104, 2019.

Harbinja, E., et al. "Online Harms White Paper: Consultation Response, BILETA Response to the UK Government Consultation'Online Harms White Paper', 2019.

Iginio Gagliardone, Danit Gal, Thiago Alves, and Gabriela Martinez. Countering online Hate Speech. UNESCO, 2015.

Kokhlikyan, N., Miglani, V., Martin, M., Wang, E., Alsallakh, B., Reynolds, J., Melnikov, A., Kliushkina, N., Araya, C., Yan, S., & Reblitz-Richardson, O. Captum: A unified and generic model interpretability library for PyTorch, 2020.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019.

Mozafari, M., Farahbakhsh, R., & Crespi, N. Hate speech detection and racial bias mitigation in social media based on BERT model. PLOS ONE, 15(8), e0237861, 2020.

MIT Technology Review. 10 Breakthrough Technologies, 2020. Available from:

Nash, V. Revise and resubmit? Reviewing the 2019 Online Harms White Paper. Journal of Media Law, 11(1), 18–27, 2019.

Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science, Vol 366 (6464), 447–453, 2019.

Pari, C; Nunes, G; Gomes, J. Avaliação de técnicas de word embedding na tarefa de detecção de discurso de ódio. In: Encontro Nacional De Inteligência Artificial E Computacional (ENIAC), 16 ,2019, Salvador. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, p. 1020-103, 2019. DOI:

Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N. A. The Risk of Racial Bias in Hate Speech Detection. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1668–1678, 2019.

Suler, J. The Online Disinhibition Effect. CyberPsychology & Behavior, 7(3), 321–326, 2004.

Sun C., Qiu X., Xu Y., Huang X. How to Fine-Tune BERT for Text Classification?. In: Sun M., Huang X., Ji H., Liu Z., Liu Y. (eds) Chinese Computational Linguistics. CCL 2019. Lecture Notes in Computer Science, vol 11856. Springer, 2019. Cham.

YouTube. Hate speech policy. 2020. Available from:

Twitter. Hateful conduct policy, 2020. Available from:

Zuckerberg, Mark. Mark Zuckerberg Stands for Voice and Free Expression, 2019. Available from:




How to Cite

Cortiz, Diogo, and Arkaitz Zubiaga. 2021. “Ethical and Technical Challenges of AI in Tackling Hate Speech”. The International Review of Information Ethics 29 (March). Edmonton, Canada.