Wikipedia:

Hamshahri Corpus

Hamshahri is one of the first online Persian newspapers in Iran. It has presented its archive to the public through its website [1] since 1996.


Hamshahri Corpus has been created by crawling the online news articles from the Hamshahri's website and processing the HTML pages to create a standard text corpus for modern Information Retrieval experiments.


The collection contains 190,206 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, Society, foreign news, sports, etc.


The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) with the average of 1.8 KB.


The corpus is availble in several formats for download [2]:

  • Tagged Text: 560 MB
  • In SQL Server 2000 Tables: 712 MB

See Also

External Links


 
 
 

Join the WikiAnswers Q&A community. Post a question or answer questions about "Hamshahri Corpus" at WikiAnswers.

 

Copyrights:

Wikipedia. This article is licensed under the GNU Free Documentation License. It uses material from the Wikipedia article "Hamshahri Corpus" Read more

Search for answers directly from your browser with the FREE Answers.com Toolbar!  
Click here to download now. 

Get Answers your way! Check out all our free tools and products.

On this page:   E-mail   print Print  Link  

 

Keep Reading

Mentioned In: