Where to get WGETWGET is freeware/Gnu UNIX based software written by Hrvoje Niksic (E-mail: [email protected]). It is used for manual and automatic mirroring of FTP and Web sites. Various versions are available via the following links:
Using WGET from behind a firewallGUI WGET - using a Tcl/Tk Script ("Utter Coolness")From Paul Rahme (Mr Bogus), South Africa. E-mail: [email protected]
|
WGET for Grabbing files off VMS FTP SitePatch and version by:Jan Prikryl - [email protected] http://www.cg.tuwien.ac.at/staff/JanPrikryl.html Institute of Computer Graphics and Visualisation Vienna University of Technology, Austria
|
Possible Alternatives to WGET - PAVUK, FWGET and LFTP
|
Optimising and Compiling WGETBy default, WGET converts the ~ (tilder) into a numeric equivalent which can create a bit of havoc when trying to do auto-mirroring like this. To have it leave the ~ as is, the following is done (thanks to Hrvoje on mentioning how to do this):For the obsolescent WGET 1.5.3
For the latest WGET 1.7.1 and later
|
The .wgetrc Config FileAfter compiling WGET, you need to install a .wgetrc file in your home area. The WGET manual file goes into detail about options and how to possibly setup the .wgetrc file. Main things to note is which vary from the default.
(To disable robots.txt via the command line, add --execute robots=off) Click here to view the simple CCP14 .wgetrc file.
|
WGET Manual Files(from wget-1.5.3 - September 1998)Windows Help File format(In Zip format - from wget-1.5 - April 25th 1998) |
Setting Up Script Files and Cron Job for AutomirroringAutomirroring is still under Construction. An alpha version of a WGET that is VMS friendly is being used to automatically mirror VMS FTP servers. (affects GSAS, Larry Finger FTP Site and Profil).Automirroring run daily (or if less frequent mirroring is request by remote web-master) using cron (~1 to ~5am UK time). Depending on the site, some customisation can be required. Either to link the web to the ftp site, or overcome "non-recommended" html in the files that WGET is not happy with. The reason for the above logging times is that grabbing files over the US link (from the UK) incurrs a charge (~2pence per Meg) but there is a free time between 1am and 5am. Thus automirroring is rigged to occur between these hours - though even if done outside these hours, the amount of material coming over for incremental mirroring is quite trivial. Just before 5am, all wget jobs are killed and the automatically emailed summary log is quickly examined to see which jobs did not complete. These can be swapped around so they do complete within the required time.
Defining what each WGET option does(Credit goes to the IUCr Technical mirroring page at http://www.iucr.org/iucr-top/docs/mirror/)
|
Efficient Web/FTP Mirroring - using WGETEfficient HTTP Based Web Mirroring - using WGETGenerally, HTTP based mirroring is "NOT" very efficient for incremental updates as it requires that every web file is "pinged" to check that it has changed. However (presently) if you have to do some processing of these files, or use it's internal links to get off-domain files, it is quite effective. On small sites, HTTP based mirroring is easy to do and not a problem, but on large sites, it can take most of the day to do a daily or weekly incremental mirror.
Efficient FTP Based Web/FTP Mirroring - using WGETFTP mirroring can be far more efficient for mirroring sites as to get an incremental update of a web or ftp site, a directory listing is all it requires to check whether there have been updated files. (compared to http mirroring where it has to ping every file to do this check)While FTP based mirroring can be an order or magnitude or more efficient than HTTP based mirroring; one downside with ftp mirroring is that WGET cannot presently interrogate the html files grabbed by FTP to convert any absolute links to relative links, etc. This method also assumes all the files are available by grabbing the FTP tree. For web areas not accessible via public/anonymous FTP, just use password protected access.
#Armel Le Bail's WEB Site - via password protected FTP # E-mail: [email protected] (Armel LeBail) $PROGDIR/wget -r -nH --cut-dirs=4 -N -nr -l0 -np --cache=off \ ftp://username:[email protected]//home/armel/doobry/webdocs \ -P $PUTDIR/web-mirrors/armel \ -o $LOGDIR/armel.logNotice how the --cut-dirs=4 is used to make 4 level deep directory that houses the web pages look like the root area on the web at mirrored pages ../ccp/web-mirrors/armel/ Otherwise the mirroring would be at ../ccp/web-mirrors/armel/home/armel/doobry/webdocs It is also possible to create an account where all that can be performed is read-only ftp and the web-area to be mirrored also looks like an root ftp area, as though it was an anonymous ftp area, but with password access restriction. On a SGI IRIX system, this can initially be a fiddly job to set up but once you know how to do this (preferably by having access to a local UNIX/network admin guru) (man ftpd explains how to set this up in quite some detail).
|
ULTRA Efficient netlib style Co-operative Mirroring - NOT Presently in WGETThe basis of this system is that the server being mirrored co-operates in providing information on what files have been modified over time.Refer to the following published article on this. From [email protected] Thu Jun 11 23:41:38 1998 Date: Thu, 11 Jun 1998 18:30:54 -0400 To: [email protected] From: "Eric Grosse" <[email protected]> Subject: re: More Detailed Spec on your Mirroring Method? There is the paper Article Grosse:1995:RM, author = "Eric Grosse", title = "Repository Mirroring", journal = "ACM Trans. Math. Software", volume = "21", number = "1", pages = "89--97", month = mar, year = "1995", CODEN = "ACMSCU", ISSN = "0098-3500", bibdate = "Tue Apr 25 15:42:31 1995", URL = "ftp://netlib.bell-labs.com/netlib/crc/mirror.ps.gz", keywords = "C.2.4 [Computer-Communication Networks]: Distributed Systems -- distributed databases", subject = "archives; checksum; distributed administration; electronic distribution; ftp" The basic idea is to publish (as an ordinary file on the web/ftp site) a simple listing of the entire collection as pathname unixtime bytes checksum and then provide utilities to compare two such files, generating commands to bring the "slave" into sync with the "master." The customary shell or perl tools can be used to filter these files for complicated master/slave relationships, or you can use it straight, for straightforward mirroring. If I were redoing this today, I might use persistent HTTP in place of ftp, and might merge netlib's MD5 files into these "crc" files. But that would be pushing farther forward than some people want to go just yet; ftp and crc are still the right conservative choice. You're welcome to repost this to the wget mailing list if you like. Eric |
rsync style Co-operative Mirroring - NOT Presently in WGETWhile not present in WGET yet, another option for high effeciency co-operative mirroring is the rsync software. The server you wish to mirror must have an rsync server running. These are not difficult to install and configure.
|
Logging Details of Nightly Incremental UpdatesTo keep control on the incremental nightly mirroring, script files log the time when a backup has finished into a single file which is the emailed each morning to the operator. This makes any problems that have occured the evening before quite obvious. A report-proforma.txt text file is kept up to date with sites being mirrored. At the start of the cron job, a new report.txt file is created. 01 01 * * * cp web_area/mirrorbin/logs/report-proforma.txt web_area/mirrorbin/logs/report.txt Script file run after each incremental WGET script is of the following type: #ANSTO LHPM Rietveld FTP Area # E-mail: [email protected] (Brett Hunter) $PROGDIR/wget -nH -r -N -nr -l0 -np -X /cgi-bin \ ftp://ftp.ansto.gov.au/pub/physics/neutron \ -P $PUTDIR/ccp14/ftp-mirror/ansto \ -o $LOGDIR/australia.log set DATE=(`date`) sed "/ANSTO_LHPM_FTP/s/NOT_finished/COMPLETED $DATE/" $LOGDIR/report.txt > $LOGDIR/report.txt.new mv $LOGDIR/report.txt.new $LOGDIR/report.txtThis substitutes the relevant "NOT_finished" with "COMPLETE and the date-time" such that un-finished jobs due to network/domain problems are relatively obvious. EUROPE_ONE Fullprof_PLOTR_FTP has COMPLETED Thu Aug 6 01:38:12 BST 1998 PLOTR_Web has COMPLETED Thu Aug 6 01:38:28 BST 1998 Simref_Simpro_MEED has COMPLETED Thu Aug 6 01:38:47 BST 1998 FIT2D_FTP has COMPLETED Thu Aug 6 01:38:49 BST 1998 FIT2D_Web has COMPLETED Thu Aug 6 01:41:32 BST 1998 Jana_FTP has COMPLETED Thu Aug 6 01:41:55 BST 1998 Jana_Web has NOT_finished Stefan_Krumm_FTP has COMPLETED Thu Aug 6 01:42:40 BST 1998 Stefan_Krumm_Web has COMPLETED Thu Aug 6 01:43:27 BST 1998 XND_Rietveld_FTP has COMPLETED Thu Aug 6 01:43:33 BST 1998 XND_Rietveld_Web has COMPLETED Thu Aug 6 01:44:02 BST 1998 Sirware_Web has COMPLETED Thu Aug 6 01:45:58 BST 1998 XPMA_Zortep_Web has COMPLETED Thu Aug 6 01:46:00 BST 1998 DMPLOT_FTP has COMPLETED Thu Aug 6 01:46:13 BST 1998 DIRDIF_FTP has COMPLETED Thu Aug 6 01:46:20 BST 1998 CRUNCH_Web has COMPLETED Thu Aug 6 01:47:17 BST 1998 DIRDIF_Web has COMPLETED Thu Aug 6 01:47:45 BST 1998 DRXWIN_FTP has COMPLETED Thu Aug 6 01:47:47 BST 1998 DRXWIN_Web has COMPLETED Thu Aug 6 01:48:56 BST 1998 AXES_FTP has COMPLETED Thu Aug 6 01:49:02 BST 1998 BGMN_Web has COMPLETED Thu Aug 6 01:49:53 BST 1998 ORTEX_Suite_Web has COMPLETED Thu Aug 6 01:50:03 BST 1998 All jobs are terminated before 5am (UK time), and the report is them mailed to the operator using the following command in cron 05 05 * * * /usr/sbin/Mail -s "Mirroring_Results `date`" [email protected] < web_area/mirrorbin/logs/report.txt |