# Job_Visual

**Repository Path**: FormatFa/Job_Visual

## Basic Information

- **Project Name**: Job_Visual
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2020-03-28
- **Last Updated**: 2024-08-02

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

## 计算机人才招聘大数据可视化系统

- 在线预览地址

http://47.105.180.125:4010/ind2/main

- 项目连接

可视化系统: https://gitee.com/FormatFa/Job_Visual

Spark数据清洗: https://gitee.com/FormatFa/Job_Clean

Scrapy爬虫: https://gitee.com/FormatFa/Job_Craw

- 截图预览

  ![](screen/screen1.png)

  ![](screen/screen2.png)

  ![](screen/screen3.png)
  ![](screen/screen4.png)


## 技术栈

- Scrapy 爬虫
- Spark 数据清洗
- Flask 搭建web后端
- ECharts图表可视化展示

## 安装使用

### 1. 系统后端

1. 安装依赖

   `pip install -r requirements.txt`

2. 数据库配置

   项目使用MySQL数据库，数据库的连接配置会从`database.json`中读取(推荐)，如果不存在就会使用代码中设置的。

   设置数据库地址
   ， 每个人电脑的数据库用户名host之类的可能不一样，在工程目录下建立一个database.json，内容下面这样，host那些改成对应的

```json
{
  "host": "localhost",
  "user": "root",
  "password": "root"
}
```
2. 启动命令

   `python manage.py`


#### 服务器部署

在linux服务器上部署时，有两种方法

- 测试时临时部署
`python3 manage.py` 即可，这种方式在连接关闭后（如关闭xshell连接）会停止


- gunicorn
部署。使用gunicorn后台运行部署，这种的会一直在后台运行，输出日志和访问日志这样是在当前目录，可以设置到指定目录
`export FLASK_ENV=production`
`gunicorn --daemon  --access-logfile access.log --error-logfile errors.log -b 0.0.0.0:4010  "job:create_app()"` 

- 停止运行。先查找到pid ` pstree -ap|grep gunicorn`  杀死线程 `kill -9 1234`, 重启线程 `kill -HUP 1234`

####  运行环境

在开发时，有两个运行环境

-  开发development
-  部署production

 通过设置环境变量FLASK_ENV的值来设置。
 FLASK_ENV环境变量是什么值，在config中获取到的就是什么。


 命令的方式设置成开发环境（或者ide)

- linux:
   `export FLASK_ENV=development`
-  windows cmd:
   `set FLASK_ENV=development`
-  windows powershell:
   `$env:FLASK_ENV = "development"`

 运行时，根据输出信息来看显示是什么环境
```
 Serving Flask app "flaskr"
环境* Environment: development 
* Debug mode: on
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
* Restarting with stat
```

### 2. 数据清洗

在idea里搭建好Spark开发环境后，运行`JobCleaner.scala`清理招聘数据，运行`WordCount2.scala`词频统计。

### 3. 数据采集

设置输出日志等级，修改`settings.py`

```
LOG_LEVEL="xxx"
```

#### 测试爬取命令

```
scrapy crawl n1
```

#### 输出字段

```

cate1,cate2,city,cname,cnum,ctrade,ctype,detail,edu,exp,name,num,pubtime,salary,url,welfare

"cate1","cate2","city","cname","cnum","ctrade","ctype","detail","edu","exp","name","num","pubtime","salary","url","welfare"

```

#### windows运行爬虫


- 测试爬虫代码(爬取部分)

```
scrapy crawl n1 -a cate_data=test_data.json 
```

- 爬取所有计算机行业的

- `scrapy crawl n1 --loglevel=WARN`


#### 命令行启动爬虫


在shell脚本中运行时，为了控制输出的目录，添加命令行参数，scrapy里通过-a添加的键值对会传到Spider的构造函数里。

添加的运行命令行参数:

- savepath  爬取保存的路径

- cate_data  分类数据的名字


--loglevel=WARN 设置日志等级


服务器脚本代码: 

scrapy crawl n1 -a savepath=${CRAW_DATA}/${DATE_NAME}.csv -a  cate_data=data.json --loglevel=WARN


#### 分类数据的JSON

原始.json 保留三个级别类的所有分类id

all_cate.json 删除了计算机类中的电子商务，运营大类的。和数据-其他 这几个类,因为2000也太多,爬取计算机行业时用

test_data.json 只有一个分类，测试代码用

### 开发者人员

https://gitee.com/FormatFa

https://gitee.com/jianjiana11

https://gitee.com/xiongbibi