A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
- SlavicBert - multilingual BERT model. The repository contains Bulgarian+Czech+Polish+Russian
- Allegro BERT - It has not been publish yet (12.2019) - but there is a poster - https://conference.mlinpl.org/pdf/CfC_AllPosters.pdf
- Word2vec polish models http://dsmodels.nlp.ipipan.waw.pl/w2v.html
- FastText polish model FB - Common Crawl, Wikipedia
- FastText polish model
- Word embeddings and language models for polish (Word2vec, fasttext, Glove, Elmo) - https://github.com/sdadas/polish-nlp-resources
- Polish Word Embeddings Review - Evaluation of polish word embeddings prepared by various research groups. Evaluation is done by words analogy task https://github.com/Ermlab/polish-word-embeddings-review
- Computional Linguistics in Poland (CLiP) http://clip.ipipan.waw.pl/: website cotains complex information about tools, resources, research centers and projects related to NLP of Polish
- AGH DSP: different projects considering use of Polish language, speech mainly http://www.dsp.agh.edu.pl/pl:research:main
- "Evaluation of Sentence Representations in Polish" - Sławomir Dadas, Michał Perełkiewicz, Rafał Poswiata 2019 https://arxiv.org/pdf/1910.11834.pdf
- The KLEJ (Kompleksowa Lista Ewaluacji Językowych) benchmark is a set of nine evaluation tasks for the Polish language understanding.
- Wroclaw Corpus of Consumer Reviews Sentiment (WCCRS)
- PolEval datasets -
- Hate speach classification - In this task, the participants are to distinguish between normal/non-harmful tweets (class: 0) and tweets that contain any kind of harmful information (class: 1). This includes cyberbullying, hate speech and related phenomena: [PolEval 2019 Task6] [Ermlab mirror GDrive]
- Ermlab Opineo dataset - https://github.com/Ermlab/pl-sentiment-analysis - GDrive
- HateSpeech corpus in the current version contains over 2000 posts crawled from public Polish web. They represent various types and degrees of offensive language, expressed toward minorities (eg. ethnical, racial). The data were annotated manually. http://zil.ipipan.waw.pl/HateSpeech
- Polish Speech Corpus (DSP AGH) http://www.dsp.agh.edu.pl/en:resources:korpusmowy : 55 hours of annotated Polish speech
People who contribute to this project.
- Krzysztof Sopyła - https://ksopyla.com LinkedIn