[PHP] Scraping webpages

Soldato
Joined
12 Jun 2005
Posts
5,361
Hi there,

Does anyone know of a decent PHP function to scrape the HTML from a webpage because file_get_contents doesn't always work.

Trying this at the moment but it says "Bad Request":

Code:
	function GetPageHTML ($URL)
	{

        $userAgent = 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)';

        $curl = curl_init($URL);
        curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
        curl_setopt($curl, CURLOPT_AUTOREFERER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
        curl_setopt($curl, CURLOPT_TIMEOUT, 2 );                

        $html = curl_exec( $curl );

        $html = @mb_convert_encoding($html, 'HTML-ENTITIES', 'utf-8');   

        curl_close( $curl );
        
        return $html;
	}

Thanks.
 

Pho

Pho

Soldato
Joined
18 Oct 2002
Posts
9,324
Location
Derbyshire
The code you posted in the opening post works fine for me:


Are you hosting this locally or on a webhost? I wonder if accessing external URLs has been blocked?
 
Soldato
OP
Joined
12 Jun 2005
Posts
5,361
What is the output of:
Code:
print_r(curl_getinfo($curl));
(Place it before your curl_close() line).

Returns:

Code:
Array ( [url] => http://epguides.com/Scrubs/ [content_type] => text/html [http_code] => 400 [header_size] => 129 [request_size] => 140 [filetime] => -1 [ssl_verify_result] => 0 [redirect_count] => 0 [total_time] => 0.685881 [namelookup_time] => 0.342556 [connect_time] => 0.502051 [pretransfer_time] => 0.502106 [size_upload] => 0 [size_download] => 20 [speed_download] => 29 [speed_upload] => 0 [download_content_length] => 20 [upload_content_length] => 0 [starttransfer_time] => 0.685844 [redirect_time] => 0 ) 1
 
Soldato
OP
Joined
12 Jun 2005
Posts
5,361
The code you posted in the opening post works fine for me:


Are you hosting this locally or on a webhost? I wonder if accessing external URLs has been blocked?

It sometimes works, but other times it doesn't.

I am downloading about 7 pages sequentially and it usually doesn't download them all, but when I try and download just one page...it sometimes works, sometimes doesn't. Its an intermittent problem.
 
Soldato
OP
Joined
12 Jun 2005
Posts
5,361
LOL...turns out there was some whitespace on the end of the URLs I was trying to get the contents of which prevented it...and obviously when I printed the string that contained the url it looked fine.

The reason I didn't figure this out before is that for some reason my php service won't print any errors out in my php scripts (even when I manually set the error reporting)....anyone know why that is?
 
Associate
Joined
27 Jun 2006
Posts
1,473
Ah whitespace - been where you are so many times before.
Now if I print out anything to debug my (dodgy) code I also put an asterisk each side so any whitespace shows up!!

So many hours lost to that problem :D
 
Back
Top Bottom