![]() ![]() |
Home |
How it works |
Projects archive |
Contact Us Air Compressor Bot |
|
| The Career Path of Freelance Programming Jobs |
Html Parsing In Python |
![]() |
Bidding Time: |
18/04/2006 20:56 - 09/05/2006 20:56 |
Budget: |
$2500-5000 |
Status: |
Closed |
|
|
|
Job Type: |
|
Description: |
for now, need html parsers in python. i'm collecting data from many different websites, storing them in a ZODB database. Still in data collection phase, as there are 50+ sites that need to be parsed. The data I'm collecting is analyzing the betting odds of various sports events. The sites are sportsbetting websites. the following is a spec, with an attached file as an example of a site parsed properly. ***** Each parser will extract info from sports betting sites. Data extracted is supposed to fill several instances of the Bet class: class Bet(object): def __init__(self): self.sportsbook = ' self.awayTeam = ' self.awayGameNumber = -1 self.homeTeam = ' self.homeGameNumber = -1 self.betType = ' self.gamedate = None self.sport = ' self.league = ' self.gamepart = ' self.side = ' self.spread = 0.0 self.overunder = 0.0 self.moneyadj = 0.0 self.ways = 2 self.maxbet = 1000.0 The attributes: self.sportsbook is the site from which parsed data come from. self.awayTeam, self.homeTeam are the participants. away is the visiting, for instance, in a line like "NY Nicks at Chicago Bulls", (NY Knicks plays at Chicago Bulls) NY Knicks is the awayTeam and Chicago Bulls the homeTeam. self.awayGameNumber and self.homeGameNumber are calculated before and should be ignored. self.betType can be either spread, over/under or moneyline. self.gamedate is the game date. Should be a datetime.datetime instance. self.sport is the sport name, in lowercase. self.league is the league name, when available, in lowercase. self.gamepart is the gamepart name ('game', '1st quarter', 'half', etc), in lowercase. self.side is the bet side, either 'home', 'away' or 'draw' self.spread, self.overunder and self.moneyadj are float values with the odds for each bet. Which one is available (!= 0.0) depends on the betType self.ways refers to 2 or 3-line bets. On 2-line bets, you get your money back in case of draw. On 3-line, you lose money if draw, but you can bet on draw too , so the parser should generate a bet object for draw too. self.maxbet is the max bet value allowed. default value is 1000.00. The parsers should follow the following specs: A parser module should have a Connection and BetFactory classes, with the following interface: class Connection(object): def __init__(self): self.online = False self.build_opener() def reset(self): self.opener.close() self.online = False self.build_opener() def close(self): self.opener.close() self.online = False def build_opener(self): self.opener = urllib2.build_opener() def login(self): raise NotImplementedError def get_data(self, sports=SPORTS): raise NotImplementedError class BetFactory(object): def extract_bets(self, data, sports=SPORTS): raise NotImplementedError Cookie management and everything else related to connection should be done on Connection.build_opener() and Connection.login() methods. It's ok to have some parsing code for login page on Connection.login(), but parsing and networking code should be completely isolated. Connection.get_data() should connect to the site and extract all data needed, and return it on the format expected by BetFactory.extract_bets(). There's some preference if data can be a single string, but it's ok to be in any format, as long as it's exactly what BetFactory.extract_bets() expect. BetFactory should create all HTMLParsers subclasses instances needed and use them to parse raw data. BetFactory.extract_bets() should return a sequence (or a generator if possible) with Bet instances for each line parsed. HTML parsing should be done using Python standard library HTMLParser module, unless dealing with buggy html pages and something else is needed. HTTP access should be done using urllib2, urllib, and clientcookie modules. External modules should be avoided as much as possible. Some styling rules: each HTML page should have it's own parser implemented on a HTMLParser.HTMLParser subclass don't use from x import * paying anywhere from $30-$70 a site depending on the site difficulty and how fast the code can be written Related Projects: This project is the proprietary information of .
Click here to remove this project from OUR database.
|
Operating System: |
Linux |
Database System: |
(None) |
| <<< back |
|
| Home | Projects archive | RSS | Resources | Links | Contact Us | © 2004-2008 ProjectsList.biz /0.442 |