# rssSpider

**Repository Path**: king0222/rssSpider

## Basic Information

- **Project Name**: rssSpider
- **Description**: Rss spider by nodejs ,   rss 爬虫，正文抓取
- **Primary Language**: JavaScript
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-02-23
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# rssSpider

Design and coding with all the love in the world by ShaneLau.


> The simplest way to use rssspide to fetch rss list and site info.  
> Fetch post'content ,give clean view to you.  
>rss 爬虫，快速抓取站点信息和文章列表，文章的正文抓取

This project is base on [feedparser](https://github.com/kballard/feedparser) and [node-readability](https://github.com/luin/node-readability)


## Usage  

```
npm install rssspider
```
Then:

```
var spide = require('rssspider');
var url = 'http://www.bigertech.com/rss';
spide.fetchRss(url).then(function(data){
		console.log(data); // rss  post list
});
```

## API Documentation

### 1. <code>fetchRss(url,[options])</code>

get rss site'post list  ,like this  [www.bigertech.com/rss](http://www.bigertech.com/rss)

*  **url** : webiste'rss url
*  **options** :what data you need ?  default value:

```
	['title','description','summary','date','link','guid','author','comments','origlink','image','source','categories','enclosures']
```  
response data
**Array**  

```
[{ title: '一个营销人员的自我修养',
  description: '<p></p>',
  summary: '</p>',
  date: Wed Oct 08 2014 17:14:26 GMT+0800 (CST),
  link: 'http://www.bigertech.com/learn-social-media-marketing/',
  guid: 'a623d78a-dae9-4915-9caa-0fd34fb3757c',
  author: '巴依老爷',
  comments: null,
  origlink: null,
  image: {},
  source: {},
  categories: [],
  enclosures: [] },
  ....  // more
	]

```

### 2. <code>siteInfo(url,[options])</code>
get website info  

* **url**   webiste'rss url
* **options**  what data you need ?  default value:

    ```
['title','description','date','link','xmlurl','author','favicon','copyright','generator','image']

    ```
response data **Array**

   ```
  { title: '笔戈科技',
  description: '简单、有趣、有价值',
  date: Thu Oct 09 2014 18:15:14 GMT+0800 (CST),
  link: 'http://www.bigertech.com/',
  xmlurl: 'http://www.bigertech.com/rss/',
  author: null,
  favicon: null,
  copyright: null,
  generator: 'Ghost 0.5',
  image: {},
  feedurl: 'http://www.bigertech.com/rss' }
   ```


### 3. `getCleanBody(url)`

Turn any web page into a clean view. This module is based on arc90's readability project.  

  * **html** url or html code.
  * **options** is an optional options object
  * **callback** is the callback to run - `callback(error, article, meta)`


  ```
  var url = 'http://www.bigertech.com/learn-social-media-marketing/';
  spide.getCleanBody(url).then(function(article){
        console.log(article.content);   //clean code view
    });
  ```

##### More info [node-readability](https://github.com/luin/node-readability)


#### article.content  is clean view

The article content of the web page. Return `false` if failed.


### 4. <code>getAllByUrl(url,[options])</code>
This method is similar to  **fetchRss**  
####What'more ,it fetch the clean page content.
Turn any web page into a clean view. This module is based on arc90's readability project.

* **url** website'rss url  

* **Array**  respose data

get clean view code  , Clean view **content**

```  

[{ title: '一个营销人员的自我修养',
   content:'clean code view',     // clean code view
  description: '<p></p>',
  summary: '</p>',
  date: Wed Oct 08 2014 17:14:26 GMT+0800 (CST),
  link: 'http://www.bigertech.com/learn-social-media-marketing/',
  guid: 'a623d78a-dae9-4915-9caa-0fd34fb3757c',
  author: '巴依老爷',
  comments: null,
  origlink: null,
  image: {},
  source: {},
  categories: [],
  enclosures: [] },
    ....... // more
	]

```

## test  100%
```
nodeunit test/index.js

```


### Any question [shanelau](http://weibo.com/kissliux)  
or  
[shanelau1021@gmail.com](shanelau1021@gmail.com)