segunda-feira, 2 de abril de 2012

Downloading all links from a page

You can use the best part of two worlds (Javascript and Python) to rapidly download all the links from a page. As an example, I'm download all the files from this page http://www.inf.ufes.br/~alberto/proc-par/index.html

First, using chrome, open the Developer Console (Ctrl+Shift+J), now click in Console.

Now, using some js do:

links = document.getElementsByTagName('a');
for(var i=0;i<links.length;i++) { console.log(links[i].href); }


And you should get something like this:
http://www.inf.ufes.br/~alberto/proc-par/programa_processamento_paralelo-2012-1.doc
http://www.inf.ufes.br/~alberto/proc-par/Lista_de_Exercicios_de_Processamento_Paralelo.doc
http://www.inf.ufes.br/~alberto/proc-par/Lista_de_Exercicios_de_Processamento_Paralelo2.doc
http://www.inf.ufes.br/~alberto/proc-par/avaliacao_da_aprendizagem-mestrado.doc
http://www.inf.ufes.br/~alberto/proc-par/aula1.doc
http://www.inf.ufes.br/~alberto/proc-par/aula2.doc
http://www.inf.ufes.br/~alberto/proc-par/aula3.doc
...

Copy and paste it into a python file:

links = """
And you should get something like this:
http://www.inf.ufes.br/~alberto/proc-par/programa_processamento_paralelo-2012-1.doc
http://www.inf.ufes.br/~alberto/proc-par/Lista_de_Exercicios_de_Processamento_Paralelo.doc
http://www.inf.ufes.br/~alberto/proc-par/Lista_de_Exercicios_de_Processamento_Paralelo2.doc
http://www.inf.ufes.br/~alberto/proc-par/avaliacao_da_aprendizagem-mestrado.doc
http://www.inf.ufes.br/~alberto/proc-par/aula1.doc
http://www.inf.ufes.br/~alberto/proc-par/aula2.doc
http://www.inf.ufes.br/~alberto/proc-par/aula3.doc
....
"""

# Finally download the files

import urllib
links = page.split()
for link in links:
   filename = link.split('/')[-1]
   print 'downloading', filename
   urllib.urlretrieve(link, filename)

Result:


>>>
downloading programa_processamento_paralelo-2012-1.doc
downloading Lista_de_Exercicios_de_Processamento_Paralelo.doc
downloading Lista_de_Exercicios_de_Processamento_Paralelo2.doc
downloading avaliacao_da_aprendizagem-mestrado.doc
downloading aula1.doc
downloading aula2.doc
downloading aula3.doc
downloading aula4.doc
downloading aula5.doc
downloading aula6.doc
downloading Cs252s06-lec04-cache-alberto.ppt
downloading Cs252s06-lec07-dynamic-sched-alberto.ppt
downloading Cs252s06-lec08-dynamic-schedB-alberto.ppt
downloading Cs252s06-lec09-limitsSMT-alberto.ppt
downloading L16-Vector-alberto.pptx
downloading L17-VectorII-alberto.pptx
downloading sbac-tutorial-cuda-15.ppt
downloading Slides%20-%20Chapter%203-Alberto.ppt
downloading Slides%20-%20Chapter%204-Alberto.ppt
downloading Slides%20-%20Chapter%205-alberto.ppt
downloading Slides%20-%20Chapter%206-alberto.ppt
downloading Cs252s06-lec12-SMP-alberto.ppt
downloading Cs252s06-lec13-sharedaddress-alberto.ppt
downloading openmp-ntu-vanderpas.pdf
downloading openmp-spec25.pdf
>>>


Remember that it saves the files on where the python script is located.