# open-gram **Repository Path**: crhf/open-gram ## Basic Information - **Project Name**: open-gram - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-03-22 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README open-gram ========= open-gram is a project tries to collect lexicon and build n-gram dataset for NLP in Chinese. This project tries to leverage existing open source resources like crfpp and CC-CEDICT. open-gram includes 4 parts - corpus collection - segmentation - (new) word extraction - n-gram info counting corpus collection ================= 1. crawl Chinese web sites using scrapy, grab the body HTML pages of them 2. proprocess the pages - detect the encoding - remove HTML tags and other stuff we are not interested in - split the text into sentences segmentation ============ there two ways to segment tokens into words * tagging * matching word extraction =============== n-gram info counting ====================