LINUX.ORG.RU

wget +запрет на скачивания с определенного урла


0

0

Алоэ.

Я тут вгетом статистику тяну: wget -r -l 1 -p -k http://buildserver.example.com/stats

а наверху каждой странички в stats лежат ссылки на всякую ненужную ерунду, поэтому вместо

stats/index.html
stats/date1.html
...


получаю
stats/index.html
help/index.html
setup/...

и прочую погань, на которую ссылки в статсах есть. Как этому замечательному вгету скормить список урлов, по которым ходить не надо :-?

Может ходить через какой-нибудь прокси с резалкой?

Lumi ★★★★★
()

посмотри в ммане -A и -R опции, это список что загружать или наоборот не загружать. можно использовать регэкспы. то?

wieker ★★
()
Ответ на: комментарий от wieker

4.2 Types of Files

When downloading material from the web, you will often want to restrict the retrieval to only certain file types. For example, if you are interested in downloading
GIFs, you will not be overjoyed to get loads of PostScript documents, and vice versa.

Wget offers two options to deal with this problem. Each option description lists a short name, a long name, and the equivalent command in `.wgetrc'.

`-A ACCLIST' `--accept ACCLIST' `accept = ACCLIST' The argument to `--accept' option is a list of file suffixes or patterns that Wget will download during recursive
retrieval. A suffix is the ending part of a file, and consists of "normal" letters, e.g. `gif' or `.jpg'. A matching pattern contains shell-like wildcards, e.g.
`books*' or `zelazny*196[0-9]*'.

So, specifying `wget -A gif,jpg' will make Wget download only the files ending with `gif' or `jpg', i.e. GIFs and JPEGs. On the other hand, `wget -A
"zelazny*196[0-9]*"' will download only files beginning with `zelazny' and containing numbers from 1960 to 1969 anywhere within. Look up the manual of your shell for a
description of how pattern matching works.

Of course, any number of suffixes and patterns can be combined into a comma-separated list, and given as an argument to `-A'.

`-R REJLIST' `--reject REJLIST' `reject = REJLIST' The `--reject' option works the same way as `--accept', only its logic is the reverse; Wget will download all files
_except_ the ones matching the suffixes (or patterns) in the list.

So, if you want to download a whole page except for the cumbersome MPEGs and .AU files, you can use `wget -R mpg,mpeg,au'. Analogously, to download all files except
the ones beginning with `bjork', use `wget -R "bjork*"'. The quotes are to prevent expansion by the shell.

The `-A' and `-R' options may be combined to achieve even better fine-tuning of which files to retrieve. E.g. `wget -A "*zelazny*" -R .ps' will download all the files
having `zelazny' as a part of their name, but _not_ the PostScript files.

Note that these two options do not affect the downloading of HTML files (as determined by a `.htm' or `.html' filename prefix). This behavior may not be desirable for
all users, and may be changed for future versions of Wget.

Note, too, that query strings (strings at the end of a URL beginning with a question mark (`?') are not included as part of the filename for accept/reject rules, even
though these will actually contribute to the name chosen for the local file. It is expected that a future version of Wget will provide an option to allow matching
against query strings.

Finally, it's worth noting that the accept/reject lists are matched _twice_ against downloaded files: once against the URL's filename portion, to determine if the file
should be downloaded in the first place; then, after it has been accepted and successfully downloaded, the local file's name is also checked against the accept/reject
lists to see if it should be removed. The rationale was that, since `.htm' and `.html' files are always downloaded regardless of accept/reject rules, they should be
removed _after_ being downloaded and scanned for links, if they did match the accept/reject lists. However, this can lead to unexpected results, since the local
filenames can differ from the original URL filenames in the following ways, all of which can change whether an accept/reject rule matches:

* If the local file already exists and `--no-directories' was specified, a numeric suffix will be appended to the original name.

* If `--html-extension' was specified, the local filename will have `.html' appended to it. If Wget is invoked with `-E -A.php', a filename such as `index.php' will
match be accepted, but upon download will be named `index.php.html', which no longer matches, and so the file will be deleted.

* Query strings do not contribute to URL matching, but are included in local filenames, and so _do_ contribute to filename matching.

This behavior, too, is considered less-than-desirable, and may change in a future version of Wget.

wieker ★★
()

еще --exclude-directories, если нужно исключить именно определенные директории. тоже подерживает регекспы.

drull ★☆☆☆
()
Ответ на: комментарий от nikolayd

у меня вроде работает:

drull@ubuntu:/var/www/localhost$ ls
drull@ubuntu:/var/www/localhost$ ls ../default/*
../default/dir1:
1

../default/dir1234:
1234

../default/dir2:
2

../default/dir3:
3
drull@ubuntu:/var/www/localhost$ wget -r -l 1 -p -k --quiet -X /dir1/,/dir2/ localhost
drull@ubuntu:/var/www/localhost$ ls localhost/*
localhost/index.html

localhost/dir1234:
index.html

localhost/dir3:
index.html
drull@ubuntu:/var/www/localhost$
директории /dir1/ и /dir2/ не качает. /dir1234/ качает

drull ★☆☆☆
()
Ответ на: комментарий от drull

на файлы index.html не обращай внимание - это модуль autoindex включен.

drull ★☆☆☆
()
Вы не можете добавлять комментарии в эту тему. Тема перемещена в архив.