Hamshahri Corpus
Hamshahri is one of the first online Persian newspapers in Iran. It has presented its archive to the public through its website [1] since 1996.
Hamshahri Corpus has been created by crawling the online news articles from the Hamshahri's website and processing the HTML pages to create a standard text corpus for modern Information Retrieval experiments.
The collection contains 190,206 articles covering the following subject categories: politics, city news, economics, reports,
editorials, literature, sciences, Society, foreign news, sports, etc.
The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) with the average of 1.8
KB.
The corpus is availble in several formats for download [2]:
- Tagged Text: 560 MB
- In SQL Server 2000 Tables: 712 MB
See Also
External Links
- The Homepage of Hamshahri Corpus (In Persian)
This entry is from Wikipedia, the leading user-contributed encyclopedia. It may not have been reviewed by professional editors (see full disclaimer)



