Home | How it works | Projects archive | Contact Us
Air Compressor Bot
 
The Career Path of Freelance Programming Jobs 

   Html Parsing In Python

Bidding Time:
18/04/2006 20:56 - 09/05/2006 20:56
Budget:
$2500-5000
Status:
Closed


Job Type:
Python
Description:



for now, need html parsers in python. i'm collecting data from many different
websites, storing them in a ZODB database. Still in data collection phase, as
there are 50+ sites that need to be parsed. The data I'm collecting is analyzing
the betting odds of various sports events. The sites are sportsbetting websites.
the following is a spec, with an attached file as an example of a site parsed
properly.
*****
Each parser will extract info from sports betting sites. Data extracted is
supposed to fill several instances of the Bet class:
class Bet(object):
def __init__(self):
self.sportsbook = '
self.awayTeam = '
self.awayGameNumber = -1
self.homeTeam = '
self.homeGameNumber = -1
self.betType = '
self.gamedate = None
self.sport = '
self.league = '
self.gamepart = '
self.side = '
self.spread = 0.0
self.overunder = 0.0
self.moneyadj = 0.0
self.ways = 2
self.maxbet = 1000.0
The attributes:
self.sportsbook is the site from which parsed data come from.
self.awayTeam, self.homeTeam are the participants. away is the visiting, for
instance, in a line like "NY Nicks at Chicago Bulls", (NY Knicks plays at
Chicago Bulls) NY Knicks is the awayTeam and Chicago Bulls the homeTeam.
self.awayGameNumber and self.homeGameNumber are calculated before and should be
ignored.
self.betType can be either spread, over/under or moneyline.
self.gamedate is the game date. Should be a datetime.datetime instance.
self.sport is the sport name, in lowercase.
self.league is the league name, when available, in lowercase.
self.gamepart is the gamepart name ('game', '1st quarter', 'half',
etc), in lowercase.
self.side is the bet side, either 'home', 'away' or 'draw'
self.spread, self.overunder and self.moneyadj are float values with
the odds for each bet. Which one is available (!= 0.0) depends on the
betType
self.ways refers to 2 or 3-line bets. On 2-line bets, you get your
money back in case of draw. On 3-line, you lose money if draw, but you
can bet on draw too , so the parser should generate a bet object for
draw too.
self.maxbet is the max bet value allowed. default value is 1000.00.

The parsers should follow the following specs:
A parser module should have a Connection and BetFactory classes, with
the following interface:
class Connection(object):
def __init__(self):
self.online = False
self.build_opener()
def reset(self):
self.opener.close()
self.online = False
self.build_opener()
def close(self):
self.opener.close()
self.online = False
def build_opener(self):
self.opener = urllib2.build_opener()
def login(self):
raise NotImplementedError
def get_data(self, sports=SPORTS):
raise NotImplementedError

class BetFactory(object):
def extract_bets(self, data, sports=SPORTS):
raise NotImplementedError

Cookie management and everything else related to connection should be
done on Connection.build_opener() and Connection.login() methods. It's
ok to have some parsing code for login page on Connection.login(), but
parsing and networking code should be completely isolated.
Connection.get_data() should connect to the site and extract all data
needed, and return it on the format expected by
BetFactory.extract_bets(). There's some preference if data can be a
single string, but it's ok to be in any format, as long as it's
exactly what BetFactory.extract_bets() expect. BetFactory should
create all HTMLParsers subclasses instances needed and use them to
parse raw data.
BetFactory.extract_bets() should return a sequence (or a generator if
possible) with Bet instances for each line parsed.
HTML parsing should be done using Python standard library HTMLParser
module, unless dealing with buggy html pages and something else is
needed. HTTP access should be done using urllib2, urllib, and
clientcookie modules. External modules should be avoided as much as
possible.
Some styling rules: each HTML page should have it's own parser
implemented on a HTMLParser.HTMLParser subclass don't use from x import *
paying anywhere from $30-$70 a site depending on the site difficulty and how
fast the code can be written

Start your work-at-home career for $7.00. Get direct access to thousands of freelance and home-based jobs. Click here to find work now.

Related Projects:
Css-based Web Site Templates
Designer needed by contracor
Baseball statistic webpage
Php Mysql Project
Live Auction Website

This project is the proprietary information of . Click here to remove this project from OUR database.
Operating System:
Linux
Database System:
(None)
<<< back

Recent Projects Archive:

Saturday - Friday - Thursday - Wednesday - Tuesday - Monday - Sunday

View all freelance web projects

 
Home | Projects archive | RSS | Resources | Links | Contact Us © 2004-2008 ProjectsList.biz /0.442