corz distro machine.. download/php/Clever-404.php


<?php // ۞// text { encoding:utf-8 ; bom:no ; linebreaks:unix ; tabs:4sp ; }
                                        $clever_404['version'] = '1.9.15';
// direct access -> demo..
//if (realpath($_SERVER['SCRIPT_FILENAME']) == realpath(__FILE__)) { sleep(2); header("Location: /check"); }
/*

    NOTE: Clever 404 has been deprecated (though still works great!).
    For more power, and a single-file solution, check out Active Error Pages:

        https://corz.org/server/tools/active-errors/


    Clever 404

    The corz.org intelligent php 404 error handler

    404 pages are more important than we generally think. Most visitors, when
    presented with a broken link, will just go elsewhere. Bye! A decent 404 page
    will give you a "second-shot", keep them hanging around, and shows you care,
    which many webmasters don't. Hit a 404 at corz.org for a wee demo.


    what it does..

        As well as the usual valid link to your site root and email address that
        folks can click to give feedback, 404.php provides an intelligent
        response to your site's missing pages..

        First 404.php does automatic redirection for all your moved pages.
        Whenever you move a page, simply add it to 404.php's "catcher" list, and
        have all visitors and search engines automatically and permenantly
        redirected, without any fuss or .htaccess hacking.

        For genuinely missing documents, 404.php goes on to do a (very*) quick
        scan your site, looking for similar items, and returns a list of any
        matching files, as links. 404 is capable of "fuzzy" matching, so will
        usually catch typos in hand-inputted URL's.

            * usually 0.01 seconds or less.

        If there is only a single matching document, 404.php can (optionally)
        jump the visitor directly to that document, nifty. This auto-jumping can
        be either by meta-refresh, or proper 302/301 http headers. The latter is
        preferable; though 404 does look cool, its purpose is to get users to
        the correct page ASAP, if at all possible..

            https://corz.org/physical

        Finally, if no matches are found, 404.php (optionally) presents the user
        with a corzoogle search form, the important part of their query already
        inserted into the search field, enabling them to perform a full content
        search of your web site.

        TADA!


    To use..

    i.
        Edit the preferences section (inside error-settings.php) with your own
        details, name, etc..

    ii.
        Direct your site's 404 errors to this script..

        This is achieved by editing either your master httpd.conf or
        main .htaccess file, inserting a line something like..

            ErrorDocument 400 /err/400.php
            ErrorDocument 401 /err/401.php
            ErrorDocument 403 /err/403.php
            ErrorDocument 404 /err/404.php
            ErrorDocument 410 /err/410.php
            ErrorDocument 500 /err/500.php
            ErrorDocument 503 /err/503.php

        ..which would direct all 404 errors to.. http://yoursite/err/404.php
        which is where I happen to keep *my* copy of this script, that is,
        inside a folder called "err" in the top level of your site.

        You will have noticed 404 comes with matching 403 page, 401 and so on.
        Use what you want/need.

        For more information on .htaccess files, see here..

            https://corz.org/serv/tricks/htaccess.php

        I'll likely include an example .htaccess inside the zip distribution

        On some web hosts you can chose your error pages through the CP (control
        panel) or site admin page. as a last resort, ask your website
        hosts/sysadmin how to achieve this.

    iii.
        The optional corzoogle search form (for *very* missing documents) will
        obviously need to have corzoogle installed somewhere on your site to be
        of any use. Clearly 404.php is best instructed to use the corzoogle
        search engine in the 'root', or 'top level' of your site. you can get
        corzoogle here..

            https://corz.org/corzoogle/download.php

    iv.
        The "catchers" part does automatic redirection of your moved pages.

        The idea is, when you move a page, for whatever reason, you add it to
        the catchers list. 404.php will then permanently (301) direct visitors
        straight to that page, bypassing the 404 altogether, they won't even
        realize they *got* a 404!

        As well as catching your real moved pages, this can be useful for
        catching known mis-spelt inward links (forums!), hot-linkers, and more.

        See inside "moved.ini" for the actual catchers list. Basically a list
        of old="new" entries, something like this..

weegary.php="/fun/weegary.php"
/pj_demo="/serv/security/demo/"

        etc.

    v.
        Lastly, if you want the spam-bot-foiling email address mashing function
        to work, you will need to include the mail-mash function somewhere. if
        you download the zip of this script, you'll find I have thoughtfully
        done this for you. if you move it somewhere, you'll need to edit the
        location in the preferences, below.


    That's it!
    You keep your lost visitors now!

    ;o)

    (c) corz.org 2004 -> tomorrow!

*/


// grab settings/prefs..
@include ('Clever-404-settings.php');


if (substr($_SERVER['REQUEST_URI'], -1) == "?") {
    die("improper request for non-existent page");
}

// first we do any "catchers", for pages that we have moved/redirected
// gotta do it first, we are sending http "headers"
// using output buffering on a 404 just *feels* wrong, somehow. :/
while (list($old_page, $new_page) = each($clever_404['catchers'])) {
    if (stristr($_SERVER['REQUEST_URI'], $old_page)) {

        // wait for x seconds..
        usleep($clever_404['time_to_jump'] * 1000000);

        if ($clever_404['redirect_testing']) {
            header("HTTP/1.1 302 Temporary Redirect");
        } else {
            header("HTTP/1.1 301 Moved Permanently");
        }
        header('Location: http://'.$clever_404['domain'].$new_page);
        die();
    }
}

// ok, we got a real 404 here.
// probably..


/*
    let's search for the document..   */

// init..
$level = 0;
$count = 0;
$links_array = array();
$full_name = '';
$meta_refresh = '';
$no_scan = false;

// transform scan_path into an array..
$clever_404['scan_path'] = array($clever_404['scan_path']);


// grab the filename parts of the URL string, to be used later..
$insert = rawurldecode(substr($_SERVER['REQUEST_URI'], (strrpos($_SERVER['REQUEST_URI'], '/')+1)));
if ($insert == '') $insert = basename($_SERVER['REQUEST_URI']);
if (strlen($insert) > 255)  $insert = substr($insert, 0, 255); // for levenshtein (i.e. some joker is having a laugh!)
$insert_no_ext = substr($insert, 0, strrpos($insert, '.'));
if ($insert_no_ext == '') $insert_no_ext = $insert; // folders, etc


// attempt a scan-lock, and begin the scan..
if(scan_lock($clever_404['lock_file'])) {
    scan_site();
    scan_unlock();
} else {
    $no_scan = true;
    $clever_404['message_found_NO_matches'] = $clever_404['still_scanning'];
}


// jump on single hit right now?
if (($count == 1) and ($clever_404['jump_on_single_hit'])) {

    switch (true) {

        case $clever_404['jump_method'] == '301':
            sleep($clever_404['time_to_jump']);
            header("HTTP/1.1 301 Moved Permanently");
            header("Location: $full_name");
            die();

        // don't use 307 unless you know what you are doing (passes POST variables onward, and many entities don't GET it!)
        case $clever_404['jump_method'] == '307' or $clever_404['jump_method'] == '302':
            sleep($clever_404['time_to_jump']);
            header('HTTP/1.1 '.$clever_404['jump_method'].' Temporary Redirect');
            header("Location: $full_name");
            die();

        case 'meta':
            $meta_refresh = '<meta http-equiv="refresh" content="'.round($clever_404['time_to_jump'], 0).';URL='.$full_name.'">';
    }
}


/*
    Begin Page..
                    */

if (!$clever_404['embedded']) {
    begin_header();
    echo '
<title>another beautifully caught "page not found" by the 404.php, the intelligent error handler v',$clever_404['version'],'..</title>
<meta name="description" content="',$clever_404['domain'],' 404 page.. intelligent 404 handling with seek-and-return. The non-existent file file" />
<meta name="keywords" content="404,php,404 error,error handler,auto-scan,auto-find,source code available at corz.org" />';
    finish_header();
}
  // you may want to put your header here
echo '
<!--beautifully caught by 404, the non-existent file file, from corz.org-->
<div class="content-wide">
    <div class="two-column">

        <div class="left-column">
            <h1>',$clever_404['message_404'],'</h1>
            If you\'re certain that a page <em>should</em>&nbsp;&nbsp;be here, please <a href="',$clever_404['email_address'],'?subject=404%20-%20',rawurlencode($_SERVER['REQUEST_URI']),'" title="your valuable feedback is appreciated. thanks">tell ',$clever_404['webmaster'],'</a> about it. Alternatively, click <a href="/"
            title="up to the site root">here</a> for some real links.
        </div>

        <div class="right-column">
            <div class="error">404</div>
        </div>

    </div>
    <div class="clear">&nbsp;</div>';

if ($meta_refresh) echo $meta_refresh;

do_result('out');

if ($links_page != '') {
    echo '
    <h2 id="found_matches">',$clever_404['message_found_matches'],'</h2>
        ',$links_page,'
        <div class="tiny-space">&nbsp;</div>';
    if ($clever_404['corzoogle_always'] == true and !empty($clever_404['cz_location'])) corzoogle_box();
} else {
    echo '
    <div id="found_NO_matches">
        <h2>',$clever_404['message_found_NO_matches'],'</h2>
    </div>';
    if (!$no_scan and $clever_404['cz_location']) corzoogle_box();
}
echo'
<div class="half-space">&nbsp;</div>
</div>';


end_error_page();




// show the corzoogle search form..
function corzoogle_box() {
global $clever_404, $insert_no_ext;
    $insert_no_ext = strip_stuff(urldecode($insert_no_ext));
    echo '
<h4>',$clever_404['message_do_a_search'],'</h4>

<div class="centered">
    <a href="https://corz.org/corzoogle/" target="_blank" rel="noopener noreferrer" title="corzoogle locates! (opens in a new window - Apple|Ctrl|whatever-click for a new tab instead)">
    <img src="',$clever_404['cz_img_location'],'" alt="corzoogle locates!" /></a><br />
    <br />
    <form method="get" action="',$clever_404['cz_location'],'">
    <div class="form">
        <input type="text" name="q" size="21" maxlength="256" value="',stripslashes($insert_no_ext),'" />
        &nbsp;
        <input type="submit" value="do it!" />
    </div>
    </form>
    <div class="small-space">&nbsp;</div>
</div>';
}


// attempt to achieve a scan lock.
// return true if successful..
function scan_lock($lock_file) {

    clearstatcache();
    //$lock_age = @filectime($lock_file);
    // check existence of lock file..
    if (file_exists($lock_file)) {
        $lock_age = filectime($lock_file);

        // if exists, check date/time
        if ((time() - filectime($lock_file)) > 60) {
            // if older than one minute, delete it..
            // (something bad must have happened elsewhere)
            unlink($lock_file);
        } else {
            return false;
        }
    }

    // set lock file..
    $fp = fopen($lock_file, 'wb');
    if (is_writable($lock_file)) {
        if ($fp) {
            $GLOBALS['locked'] = flock($fp, LOCK_EX);
            if ($GLOBALS['locked']) {
                // clearer than fputs, same function..
                fwrite($fp, '1');        // could put their IP in here. hmm. perhaps a lock "folder" one lock for each IP, or 1 file per IP
                //flock ($fp, LOCK_UN);    // but then system /tmp/ may not allow folder creation. hmm.
            }
            fclose($fp); // this releases the file lock!
        }
    }

    // if all is well, return success..
    if (file_exists($lock_file)) {
        return true;
    } else {
        return false;
    }
}


/*
function:scan_site()
for more comments, see corzoogle.php  spider() */
function scan_site() {
global $clever_404, $insert, $insert_no_ext, $level;


    if (!$clever_404['exact_match']) $insert = $insert_no_ext;
    for ($search=0,$search_path=''; $search <= $level; $search++) {
        $search_path .= $clever_404['scan_path'][$search];
        $search_path = str_replace($clever_404['ignore_folders'], '', $search_path);
    }

    $dirhandle = opendir($search_path);
    while ($file = readdir($dirhandle)) {

        if ($file{0} != '.') {

            if (is_file($search_path.$file)) {
                $fext = substr($file,strrpos($file,'.'));
                $itsname = basename($file);
                $short_name = substr($itsname, 0, 0 - strlen($fext));

                if (($clever_404['partial_match']) and (in_array($fext, $clever_404['allowed_extensions']) and (@stristr($file, $insert)))) {
                        do_result($search_path.$file);

                } elseif ($clever_404['fuzzy_match']) {
                    if (in_array($fext, $clever_404['allowed_extensions'])
                        // first we test if a single change gives a match
                        and (similar_text($short_name, $insert) == strlen($short_name)-1)
                            // and test that it's a single replacement..
                            and levenshtein($short_name, $insert) <= $clever_404['fuzziness_level']) {
                            // using two tests allows us to match for dodgy, non-letter
                            // characters and makes things more accurate.
                        do_result($search_path.$file);
                    }
                } else {

                    // non-fuzzy or partial match..
                    if (in_array($fext, $clever_404['allowed_extensions']) and (@stristr($itsname, $insert))) {
                        do_result($search_path.$file);
                    }
                }
            } elseif (is_dir($search_path.$file)) {
                if ($clever_404['match_dirs'] and (!in_array($search_path.$file, $clever_404['ignore_folders'])) and @stristr($search_path.$file, $insert)) do_result($search_path.$file);
                $clever_404['scan_path'][++$level] = ($file.'/');
                scan_site();
                $level--;
            }
        }
    }
}/*    end function:scan_site()
*/



function scan_unlock() {            // Don't lock, so that we can read it later for time info!!!!! r8? :/
global $clever_404;

    // unlock the lock file..
    if ($GLOBALS['locked']) { @flock($fp, LOCK_UN); }
    // delete lock file
    $deleted = @unlink($clever_404['lock_file']); // @ in case (and this has happened) the system cleaned up the lock file during the scan
                        // the irony is, it manages this because, once written, we don't actually "lock" the lock file!
}                        //  this is by design.        Actually, I'm re-thinking this, testing lock hold 7-12-08



/*
function do_result()    */
function do_result($file) {
global $clever_404, $count, $full_name, $links_page, $links_array;

    if ($file == 'out') {
        // output the page
        foreach($links_array as $link) {
            $links_page .= $link;
        }
    } else {
        $count++;
        $display_name = $title = basename($file);
        $full_name = str_replace($clever_404['scan_path']{0},'http://'.$clever_404['domain'].'/',$file);
        if ($clever_404['links_are_full']) { $display_name = $full_name; }
        array_push($links_array, '<a href="'.$full_name.'" title="'.$display_name.'">'.$display_name."</a><br />\n");
    }
}/*    end function do_result()
*/


/*
function strip_stuff()     */
function strip_stuff($string) {

    $nonos = array('.','..',' .'.'. ',',',';','[',']','*','~','#','&','?','$','%','+','=','»','«');
    $stripped = str_replace($nonos, ' ', $string);    // remove undesirables

    return trim($stripped);
}/*
end function strip_stuffing()     */




/*
    changes..

    I thought I might start keeping changes under the scripts themselves.
    it doesn't cost us anything. php will ignore this.


        1.9.15

        *    central config file: error-settings.php

        *    removed some left-over branding

        *    Added matching 400, 410 and 503 pages


        1.9.11

        *    In the event of the site scan turning up a single match, 404 can now
            redirect with a proper 301 header, just like the catchers. Most
            users wouldn't even realize they got a 404. This basically gives you
            automatic 301 permanent redirects for any pages you move. keep the
            users and spiders happy!

        *    You now can specify the catchers auto-jump method, '301', or
            old-school meta-refresh, in the preferences.

        *    Added scan locking. When 404 is scanning the site, it will place a
            temporary lock file, to prevent crazy bots and site abusers from
            running multiple file scans at once, and potentially stressing the
            server, chewing up resources.

            404 will still display, but with a message telling the user to wait
            a moment before trying again, rather than the usual search results.
            Most folk will never see this in action, but it's good to know it's
            there, preventing potential mishaps.

        *    You can now choose to have 404 return matches for directories.
            so if the user was looking for the non-existent/foo/hell they could
            get back results for /bar/shell scripts/

        *    Fixed the slashes in the corzoogle input (for '' quotes).


        1.8

        *    fixed the corzoogle image location, and some other stuff.

        *    Cleaned up distro prefs.

        *    Improved layout, now uses a nice container like my regular pages


        1.7

        *    incorporated partial matching and fuzzy matching; produces great
            results.

        *    cleaned up some xhtml output


        1.6.5

        *    Added some fuzzy matching for the file scan. A sorta request.

        *    this is a highly specialized tweak, but works great as per request.
            you can play around with things to get different results, but as it
            stands, g-dip will match g_dip.jpg, and in my own mirror,
            tempz_piles will match tempx_piles.jpg, etc. This can be
            enabled/disabled from a preference called $clever_404['fuzzy_match'].


        1.6.2-1.6.4

        *    just minor things.


        1.6.2:

        *    Fixed some potential bugs in initialisation.


        1.6.1:

        *    XHTML 1.0 Strict compliance. Nice.


        1.6:

        *    404 will now strip characters from the input string for entry into
            the corzoogle search box. for instance, a 404 for mama.mia.php will
            now enter "mama mia" into the search box, instead of "mama.mia"
            which would likely produce a lot less hits. corzoogle, of course,
            takes the dot into account

            Added some information to the readme up top, including important
            notes about editing the redirections. I discovered this the hard
            way.



        :2do..

            lost songs
            redirect lost *.mp3 (or whatever) to a special page
            like the /audio/ root.

*/


?>