网站首页php
PHP采集页面常用函数
发布时间:2015-11-02 22:26:29编辑:阅读(6517)
在处理QQ登陆接口中,当使用file_get_content或者curl取服务器页面值时, 页面一直提示502 bad getway.
而用fsockopen来处理的时候就没有这问题了,参考下面这个函数,可兼容http和https页面。
当然,你的PHP要先配置好openssl。
<?php
function getContent($url) {
if (!$url_info = parse_url($url)) {
return false;
}
switch ($url_info['scheme']) {
case 'https':
$scheme = 'ssl://';
$port = 443;
break;
case 'http':
default:
$scheme = '';
$port = 80;
}
$data = "";
$fid = fsockopen($scheme . $url_info['host'], $port, $errno, $errstr, 30);
if ($fid) {
fputs($fid, 'GET '
. (isset($url_info['path'])? $url_info['path']: '/')
. (isset($url_info['query'])? '?' . $url_info['query']: '')
. " HTTP/1.0\r\n" .
"Connection: close\r\n" .
'Host: ' . $url_info['host'] . "\r\n\r\n");
while (!feof($fid)) {
$data .= @fgets($fid, 128);
}
fclose($fid);
if($data){
$body = stristr($data, "\r\n\r\n");
$body = substr($body, 4, strlen($body));
return $body;
}else{
return false;
}
} else {
return false;
}
}
?>
另外追加一个常用的curl采集函数:
/**
* curl采集函数
*
* @param $url 需要采集的链接
* @param $postdata 需要提交的post数据,非post方式访问则留空
* @param $pre_url 伪造来源url
* @proxyip 设置代理IP
* @compression 目标url代码压缩方式
*
* @return $result 返回目标url的内容
*/
function curl_getContent($url, $postdata='', $pre_url='https://www.baidu.com', $proxyip=false, $compression='gzip, deflate')
{
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_TIMEOUT,5); //设置5秒超时
$client_ip = rand(1,254).'.'.rand(1,254).'.'.rand(1,254).'.'.rand(1,254);
$x_ip = rand(1,254).'.'.rand(1,254).'.'.rand(1,254).'.'.rand(1,254);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('X-FORWARDED-FOR:'.$x_ip,'CLIENT-IP:'.$client_ip)); //构造IP
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //返回传送内容
if($postdata!=''){
curl_setopt($ch, CURLOPT_POST, 1); //POST提交方式
curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata); //传递一个post提交所有数据的字符串
}
$pre_url = $pre_url ? $pre_url : "http://".$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
curl_setopt($ch, CURLOPT_REFERER, $pre_url); //前置来源url
if($proxyip){
curl_setopt($ch, CURLOPT_PROXY, $proxyip); //代理服务器
}
if($compression!='') {
curl_setopt($ch, CURLOPT_ENCODING, $compression); //目标url传输内容压缩方式
}
//Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; c8650 Build/GWK74) AppleWebKit/533.1
//(KHTML, like Gecko)Version/4.0 MQQBrowser/4.5 Mobile Safari/533.1s
//请求中包含一个”user-agent”头的字符串
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.11 (KHTML, like Gecko)
Chrome/20.0.1132.47 Safari/536.11Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.11 (KHTML, like Gecko)
Chrome/20.0.1132.47 Safari/536.11');
curl_setopt($ch, CURLOPT_HEADER, 0); //输出中不要包含http头
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //抓取跳转页面
$result = curl_exec($ch);
curl_close($ch);
//gbk转为utf-8
if(! mb_check_encoding($result, 'utf-8')) {
$result = mb_convert_encoding($result, 'utf-8', 'gbk');
}
return $result;
}使用方式:
<?php
$date = date('Y-m-d');
$url = "http://www.xxx.com/";
$post = "pageIndex=1&pageCount=500";
$data = json_decode(curl_getContent($url, $post), true);
print_r($data);
?>
评论