# phpcrawel **Repository Path**: simwower/phpcrawel ## Basic Information - **Project Name**: phpcrawel - **Description**: php+querylist 爬虫项目 - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2020-05-25 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # PHP+QueryList爬虫Demo ## 开发环境及工具 ### 安装phpstudy *[下载地址](http://www.phpstudy.net/a.php/211.html)* **安装方法** - 首先下载压缩包 - 解压 - 双击安装 - 使用高版本php,需要安装32位vc14==>[vc14_x86.zip](http://pan.baidu.com/s/1bAAEJW) --- ### **安装vscode** [下载地址](https://code.visualstudio.com/) **安装插件** - Composer - Code Runner - PHP Formatter - vscode-database - vscode Great icons **常用设置** --- ```javascript { "workbench.iconTheme": "vscode-great-icons", "php.validate.executablePath": "F:/php/php-7.0.12-nts/php.exe", "workbench.colorTheme": "Monokai_kd", "files.autoSave": "onFocusChange", "files.associations": { "*.wxml": "html", "*.wxss": "css" }, "window.zoomLevel": 0, "editor.fontFamily": "bpmono", "editor.fontSize": 22, "editor.lineHeight": 40, "editor.mouseWheelZoom": true, "editor.cursorStyle": "line", "editor.wordWrap": "off", "extensions.autoUpdate": true, "terminal.integrated.cursorStyle": "line", "terminal.integrated.fontFamily": "ubuntu mono", "terminal.integrated.fontSize": 20, "terminal.integrated.lineHeight": 1.5, "editor.minimap.enabled": false, "terminal.integrated.shell.windows": "C:\\Windows\\Sysnative\\cmd.exe", } ``` --- ### **配置VSCode插件** **配置code runner** ```javascript "php.executablePath": "F:/php/php-7.0.12-nts/php.exe" ``` **配置composer** ```javascript "composer.executablePath": "F:\\tools\\composer\\composer.bat", ``` --- ## **QueryList** **什么是[QueryList](https://querylist.cc/)** **composer安装QueryList** > 路径 您的目录下新建\composer.json ```javascript { "require": { "jaeger/querylist": "3.*" } } ``` **测试QueryList是否成功** > 路径 在您的目录下新建\index.php ```php require 'vendor/autoload.php'; use QL\QueryList; //采集某页面所有的图片 $data = QueryList::Query('http://cms.querylist.cc/bizhi/453.html', array( //采集规则库 //'规则名' => array('jQuery选择器','要采集的属性'), 'image' => array('img','src') ))->data; //打印结果 print_r($data); ``` --- ## **medoo** **什么是[medoo](http://medoo.lvtao.net/)** **composer安装Medoo** > 路径 您的项目路径下\composer.json ```javascript { "require": { "catfan/Medoo": "1.*" } } ``` **测试medoo是否安装成功** > 路径 您的目录下\index.php ```php require 'vendor/autoload.php'; use Medoo\medoo; $database = new medoo([ 'database_type' => 'mysql', 'database_name' => 'crawl', 'server' => 'localhost', 'username' => 'root', 'password' => 'root', 'charset' => 'utf8' ]); // 插入数据示例 $database->insert('article', [ 'article_id'=>'666', 'article_title'=>'test', 'article_content'=>'test', 'article_intro'=>'test', 'article_views'=>'33', 'article_ctime'=>'33', 'article_thumb'=>'http://www.baidu.com/1.jpg', 'article_uid'=>'22' ]); ``` --- ## **搭建数据库** **vscode database插件使用** 连接数据库 1. ==ctrl+shift+p== 2. SQL:Connon to MYSQL server 3. 输入地址 4. 输入用户名 5. 输入密码 6. 选择数据库 简单查询 1. ==ctrl+shift+p== 2. SQL:query 3. 输入查询语句 4. ==enter== 开始查询 复杂查询 1. 建立查询文件(==必须在项目内==) - ==ctrl+Shift+p== - SQL:Query Advancer Build - 书写查询语句 2. 运行查询文件 - ctrl+Shift+p - SQL:Run Query Advancer Build **新建数据库** ```SQL CREATE DATABASE crawl; ``` **USER表** ```SQL CREATE TABLE `article` ( `article_id` int(11) NOT NULL AUTO_INCREMENT COMMENT '文章id', `article_title` varchar(255) NOT NULL COMMENT '文章标题', `article_content` text NOT NULL COMMENT '文章内容', `article_intro` varchar(255) NOT NULL COMMENT '文章简介', `article_views` int(11) NOT NULL COMMENT '文章浏览数量', `article_ctime` datetime NOT NULL COMMENT '文章创建时间', `article_thumb` varchar(255) NOT NULL COMMENT '文章缩略图路径', `article_uid` int(11) NOT NULL COMMENT 'article user id', PRIMARY KEY (`article_id`) ) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 ``` **article表** ```SQL CREATE TABLE `user` ( `user_id` int(11) NOT NULL AUTO_INCREMENT, `user_phone` varchar(255) NOT NULL, `user_nickname` varchar(255) NOT NULL, `user_email` varchar(255) NOT NULL, `user_pwd` char(32) NOT NULL, `user_rtime` int(11) NOT NULL, `user_head_img` varchar(255) NOT NULL, PRIMARY KEY (`user_id`) ) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 ``` --- # 获取爬虫数据 ## 确定爬虫页面 **伯乐在线[后端php](http://blog.jobbole.com/category/php-programmer)** ## 开始编写index.php > 路径 项目文件夹\index.php **引入QueryList和Medoo** ```php require 'vendor/autoload.php'; use QL\QueryList; use Medoo\medoo; ``` **连接数据库** ```PHP /***** 连接数据库 ******/ global $database; $database = new medoo([ 'database_type' => 'mysql', 'database_name' => 'crawl', 'server' => 'localhost', 'username' => 'root', 'password' => 'root', 'charset' => 'utf8' ]); ``` **编写主函数==index()==** ```php function index(){ echo "爬虫开始,敬请期待~~~~~~~~~~~~~~\n"; #获取列表信息,设置循环次数 $crawl_num=7; for($i=1;$i<=$crawl_num;$i++){ echo "开始爬取第{$i}页内容\n"; $url="http://blog.jobbole.com/category/php-programmer/page/{$i}/"; echo "URL为{$url}\n"; #设置爬虫规则 $list_rule=[ 'article_title'=>['#archive .archive-title','text'], 'content_url'=>['#archive .post-thumb > a:first-child','href'], 'article_intro'=>['#archive .post-meta .excerpt','text'], 'article_ctime'=>['#archive .post-meta > p:first-child','text','-a'], 'article_thumb'=>['#archive .post-thumb > a > img','src'] ]; #获取爬虫数据 $list_data=crawl_data($url,$list_rule); #爬取详情的数据 foreach($list_data as $key=>$value){ echo "抓取《{$value['article_title']}》内容 \n"; #组合抓取详情规则 $detail_rule=[ 'article_content'=>['.entry','html'] ]; #抓取数据 $detail_data=crawl_data($value['content_url'],$detail_rule); #组合存入数据库的数据 $db_data['article_title']=$value['article_title']; $db_data['article_content']= htmlspecialchars( $detail_data[0]['article_content']); $db_data['article_intro']=$value['article_intro']; $db_data['article_ctime']=find_time($value['article_ctime']); $db_data['article_thumb']=$value['article_thumb']; $db_data['article_uid']=random_uid(); $db_data['article_views']=rand(100,100000); #插入数据库 $res=$GLOBALS['database']->insert('article',$db_data); if(!$res){ echo "写入《{$db_data['article_title']}》"; echo "数据失败\n"; var_dump('写入失败!'); die; }else{ echo "写入《{$db_data['article_title']}》"; echo "数据成功\n"; } } echo "爬虫程序完满执行成功~~~~\n"; die; } } ``` **根据URL抓取数据** ```php /** 爬取数据 */ function crawl_data($url,$rule) { $data=QueryList::query($url,$rule)->data; return $data; } ``` **正则匹配日期时间** ```php /** 正则匹配日期时间 */ function find_time($string){ $result=preg_match('/\d{4}\/\d{1,2}\/\d{1,2}/', $string, $matching); if($result){ return $matching[0]; }else{ return "2017/10/26"; } } ``` **随机生成UID** ```php /**随机生成UID */ function random_uid(){ $uid_list=$GLOBALS['database']->select('user',["user_id"]); $max_key=sizeof($uid_list)-1; $min_key=0; $key=rand($min_key,$max_key); $uid_data=$uid_list[$key]['user_id']; return $uid_data; } ``` **运行主函数** ```php index(); ```