User:Wnt/Python script to grab multiple files

From Wikipedia, the free encyclopedia

This is a crude python 2.7.13 script that was useful for downloading multiple files/pages from a site. The pages were specified one per line in input.txt, full URL for each (including http: or https:) - I was just using a spreadsheet to set up the multiple numbers. I wanted to keep this around in case I lose it before I need it again, and maybe it can help someone else. I did this in 2.7.13; it didn't work on 2.7.9 because the Heartbleed bug fix prevented a handshake with https. It doesn't work in Python 3.x because urllib2 was merged into urllib and apparently needs to be altered in some way more than just deleting the 2, which I didn't bother to figure out.

# -*- coding: utf-8 -*-
import sys, os, re, random, hashlib, hmac, logging, json, time, urllib2

file_base = os.path.dirname(__file__)

input_loc = os.path.join(file_base, 'input.txt')
try:
    input_file = open(input_loc,'r')
    print ('reading: ', input_loc, '\n')
    snarf = input_file.read() # this should be a fairly short file!
    urls = snarf.split('\n')
except:
    sys.exit('Input file input.txt not found in program directory')

for url in urls:

    time.sleep(0.5)
    temp = url.split('/')
    filename = temp[-1]
    print ('filename is ',filename)
    del temp[-1]
    linkbase = '/'.join(temp)+'/'
    print ('linkbase is ', linkbase)
    output_loc = os.path.join(file_base, filename)

    try:
        print ('trying to open output')
        output_file = open(output_loc,'wb')
    except:
        sys.exit('Failed to open output')

    req = urllib2.Request(linkbase+filename, headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'})
    response = urllib2.urlopen(req)
    print ('tried to open url')
    html = response.read()
    print (len(html), ' characters read\n')
    output_file.write(html)
    output_file.close()