[资源介绍]
SANAD: Single-Label Arabic News Articles Dataset for Automatic Text Categorization - Mendeley Data
SANAD Dataset is a large collection of Arabic news articles that can be used in different Arabic NLP tasks such as Text Classification and Word Embedding. The articles were collected using Python scripts written specifically for three popular news websites: AlKhaleej, AlArabiya and Akhbarona.
All datasets have seven categories [Culture, Finance, Medical, Politics, Religion, Sports and Tech], except AlArabiya which doesn’t have [Religion]. SANAD contains a total number of 190k+ articles.
How to use it:
_________- Unzip compressed resources.
- Each folder contains 6-7 sub-folders which are labeled by the category's name.
- Each sub-folder contains a set of article files corresponding to its category.
SANAD_SUBSET is a balanced benchmark dataset (from SANAD) that is used in our research work. It contains the training (90%) and testing (10%) sets.
How to use it:
_________- Unzip the compressed file.
- There are 3 main folders containing the 3 datasets: Akhbarona, Khaleej, and Arabiya.
- Each dataset-folder contains 2 sub-folders: training and testing.
- The training and testing folders include the balanced categories sub-folders.
- SANAD: 用于自动文本分类的单标签阿拉伯新闻文章数据集 - SANAD 数据集是一个包含大量阿拉伯新闻文章的大型集合,可用于不同的阿拉伯 NLP 任务,如文本分类和词嵌入。这些文章是使用为三个流行的新闻网站(阿尔哈利吉、阿尔阿拉比亚和 akhbarona)专门编写的 Python 脚本收集的。