Scripts | Codes

All languages in three languages :-)
Showing posts with label crawler. Show all posts

Extrait les liens de chaque page et retrouves les liens dans ces nouvelles pages...
Il faut créer un fichiers links.dat dans le même répertoire et y mettre les liens

Extract links from each page and find the links in these news pages ...
One should create a file links.dat and put links inside

يستخرج الروابط من كل صفحة يجدها في links.dat ثم يستخرج الروابط الجديدة الموجودة في هذه الصحف 

ينبغي إنشاء ملف links.dat و وضع الروابط فيه 

Open in a new window
<?php
//################################################
// for more codes scripts-n-codes.blogspot.com
//################################################
//
// put the links to crawl in a links.dat file; you can put one site utl for example
//
$datafile = "links.dat"; // file to keep the list of links in
$regex = "/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]/isU";  // regex to search for hrefs

$handle = fopen($datafile, "r"); // open the data file
$buffer = fgets($handle, 4096);
$oldlinks[] = $buffer; // read the first link into an array
while (!feof($handle)) {
 $buffer = fgets($handle, 4096);
 array_push($oldlinks,$buffer); // read the rest of the links into an array
}
fclose($handle); // close the data file

foreach($oldlinks as $value) { // for every link in the array
 print $value; // print it out
 $remote = fopen(trim($value), "r") or die(); //open it or fail nicely
 while (!feof($remote)) {
  $html = fread($remote, 8192); // read in the remote page
 }
 fclose($remote); // close it
 if (preg_match_all($regex, $html, $links)) { // if we find new links
  $local = fopen($datafile, "a+"); // open the data file
  foreach($links[1] as $value) { // for every new link
   $value.="\n"; // append a new line
   if(!in_array($value,$oldlinks)) { // if we haven't seen it before (nb - case sensitive)
    print($value); // print it out
    fwrite($local, $value); // and write it to file
   }
  }
  fclose($local); // close the data file
 }
 else {
  print("No links."); // we didn't find any links in the new file
 }
}
?>

Subscribe to: Posts (Atom)
attendez....