Been a while since my last post, not being as faithful as I want to be. The good news is, I’ve been working on a few things. Getting more into doing things with the web. Recently, I wanted to get a better understanding of regular expressions, and what better way than to parse html.
This post, however is not about regex, that will come in my next post where I introduce a lib for playing internet radio.
So into the topic at hand. PyBrowse is a simplified web resource grabber. I like to get webpages and extract the info I’m interested in, and using python makes it really simple.
This first started as a way for me to complete programming challenges on hacktivism sites like HackThisSite, HellBoundhackers, and SecurityOverride. All of which are really good sites for learning everything hack, well most everything. They have content mostly for web and software hacking, but very little on hardware endeavors.
Anyway, those sites offer challenges from SQL injections to cracking applications using debug break points and such. Those are great and all, but the ones I’m after are the programming/timed challenges. In the timed challenges, you have to write a program to log in, find the challenge, grab the needed data, and POST back the results all within 5-30 seconds.
That might not seem to difficult to some, but you have to handle cookies properly, parse through the html, calculate the data(from simple math, to breaking encryptions), and POST back the results. All within that little time frame.
Some of you unfamiliar with working with the web might be wondering why I keep capitalizing “post”. It is a facade used over HTTP. You have GET and POST. There are others, but they fall out of the scope of this article. To GET a page, you simply request the page, the server gives it to you, and the connection is dropped. When you POST something, your giving the server some data to act on. It could be log in information, data you filled out in some forms, or a search query. There could be a host of things the server wants.
Okay, now that we have a better understanding of what we want to do (and before I bore you to death) it’s time to jump into some python.
To work with the web, python provides us with some useful modules. On python 2.7 and earlier, we have urllib, urllib2, and cookielib. We will also be using BeautifulSoup, which is not part of the standard library and will need to be downloaded.
Urllib and urllib2 will get the webpages, cookielib will handle the session and cookies, and beautifulsoup will aid in sorting through all the ugly html.
An example of using urllib:
import urllib url = "http://www.google.com/index.html" req = urllib.urlopen(url) print req.read()Simple eh? We won’t be opening our webpages with urllib though, we will be using it to encode the POST data. So we will see it more like this:
data = {"user_name":"My_Name", "user_pads":"My_Guessable_Password" "login":"Login"} endata = urllib.urlencode(data)You might be wondering what that dict is, and why we chose the values that we did. Well, go to a website that requires a login like SecurityOverride.net and look at the source of the page, the keyboard shortcut is usually ctrl+u. Look for the tags and forms for the logins. You notice they usually two text boxes, one for the user name and one for the password, and one called submit that has a name of “login” and a value of “Login”. Those are where we will POST the data into and virtually click the submit button. You can find more info on how html is set up over at w3schools.
We will use urllib2 the same way we did urllib in the first example. Here are some examples:
#using data from previews example import urllib2 url = "http://www.hellboundhackers.org/index.php" #GET page req = urllib2.urlopen(url) with open("test1.html", "w") as f: f.write(req.read()) #POST to page req = urllib2.urlopen(url, data) with open("test.html", "w") as f: f.write(req.read())Not too hard right? Okay, what happens if we try and get a page only visible by members? It won’t work because the server doesn’t know your still you, this is where cookies come in.
To use cookies, we create a page opener that stores your browsing session. In order for us to stay logged in, we need to use the opener with every request. To do this we create a cookiejar to store things. Here is a quick.example:
import cookielib ref = "http://www.hellboundhackers.org/index.php" cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) opener.addheaders.append(('User-agent', 'Mozilla/4.0')) opener.addheader.append(('Referer',ref))You might be wondering what all that header stuff is. It turns out websites don’t like it when a program access there site that’s not considered a browser. So we have to spoof being a real browser by sending header information. This is also where we put the refer redirect url.
Okay, we have a way to get pages, give data back to page, and a way to handle cookies to make sure we stay on the same session. We now need an easy way of deciphering all that html we have. This is where beautiful soup comes in. Sure we could use regex, or old fashioned string manipulations, but there are many, many, many times where the data gets complex. And to top that, not all sites will give you correct formats or tagging. Beautifulsoup makes thing 100 times eaiser because it does all the error checking and correcting malformed tags.
Here is an example of beautifulsoup getting all the links (a href tags) from a page:
from BeautifulSoup import BeautifulSoup #assume we have the page from the previous example soup = BeautifulSoup(page) tags = soup.findAll('a') sorted_tags = [ i['href'] for i in tags if i.has_key('href')] for tag in sorted_tags: print tag print "-"*30With 4 lines of code, we have scraped all the links on a given page. You could even cut it down to 2 lines with more list comprehension, but then it would start getting ugly and unreadable.
Welp folks, there you have it. Now that we have a better understanding, let’s put all that knowledge into a module.
This on will also find input tags, so it can be eaiser to find POST data and see what sites are really asking for. I give an example on how to use it in the doc string, and have already given examples on how the different parts work, so I will leave you with this:
# pybrowse.py # Author K.B. Carte # Dec. 14, 2011 # # Class to handle simple web requests and parsing. # Can be used as a base class for web crawlers # or finding login fields and expected values. # # TODO: # add more tag parsing, such as embedded video/flash, # images, mp3s/audio, direct download links, etc. from BeautifulSoup import BeautifulSoup import urllib import urllib2 import cookielib class Browser: '''Simple class to fetch internet resources. Can get pages, post to pages, grab all external or local links on page, grabs all input forms hidden or not, and handles cookie sessions. Usage: bro = Browser() #great site btw url = "http://www.newgrounds.com/index.html" bro.openerSession(url) page = bro.getPage(url) inp = bro.getInputs(page) for item in inp: for k, v in zip(item.keys(), item.values()): print k, v print "-"*30 ''' def openerSession(self, ref): '''Aids in cookie handling for browsing sessions, not entirely needed, but some sites need cookies, such as forums/bbs sites.''' cj = cookielib.CookieJar() self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) self.opener.addheaders.append(('User-agent', 'Mozilla/4.0')) self.opener.addheaders.append(('Referer',ref)) def getPage(self, url): '''Self explanatory.''' get_req = self.opener.open(url) return get_req.read() def postPage(self, url, data): '''POST data to a page. data is type dict containing the values you need to encode. I.E. username, password, search query, etc. You can use the getInputs method to look for tag named and values needed for POSTing''' endata = urllib.urlencode(data) get_req = self.opener.open(url, endata) return get_req.read() def getInputs(self, page): '''Get all the from page. Returns tuple with raw tags, and common attributes of tags. ( ([raw tags], {common attrib} ) common attrib: Type, Name, Value of tag''' soup = BeautifulSoup(page) inpts = soup.findAll('input') raw_tags = [str(i) for i in inpts] common_attrib = [] for attrib in inpts: buff = {} if attrib.has_key("type"): buff["Type"] = attrib["type"] else: buff["Type"] = "none" if attrib.has_key("name"): buff["Name"] = attrib["name"] else: buff["Name"] = "none" if attrib.has_key("value"): buff["Value"] = attrib["value"] else: buff["Value"] = "none" common_attrib.append(buff) return (raw_tags, common_attrib) def getLinks(self, page, base): '''Grabs all the links in the page. base helps check if the link is local or points to an external site. Note that local links returned don't have the base url prepended if they don't on the page itself, you will have to explicitly do that. Returns tuple of local and external urls. ( local, external )''' soup = BeautifulSoup(page) tags = soup.findAll('a') tags = [tag['href'] for tag in tags if tag.has_key('href')] local = [] external = [] for tag in tags: if base in tag: local.append(tag) elif "http" not in tag: local.append(tag) else: external.append(tag) return (local, external)It could use a little work and optimization, but it works for a quick hack. BTW, this article, along with the module was written on my Android Phone, and tested with SL4A (aka ASE). Phew, boy to my thumbs hurt lol. Also tested on Windows and Linux.