Logo Search packages:      
Sourcecode: wapiti version File versions  Download package

BeautifulSoup::BeautifulSoup Class Reference

Inheritance diagram for BeautifulSoup::BeautifulSoup:

BeautifulSoup::BeautifulStoneSoup BeautifulSoup::Tag BeautifulSoup::PageElement BeautifulSoup::ICantBelieveItsBeautifulSoup BeautifulSoup::MinimalSoup

List of all members.


Detailed Description

This parser knows the following facts about HTML:

* Some tags have no closing tag and should be interpreted as being
  closed as soon as they are encountered.

* The text inside some tags (ie. 'script') may contain tags which
  are not really part of the document and which should be parsed
  as text, not tags. If you want to parse the text as tags, you can
  always fetch it and parse it explicitly.

* Tag nesting rules:

  Most tags can't be nested at all. For instance, the occurance of
  a <p> tag should implicitly close the previous <p> tag.

   <p>Para1<p>Para2
    should be transformed into:
   <p>Para1</p><p>Para2

  Some tags can be nested arbitrarily. For instance, the occurance
  of a <blockquote> tag should _not_ implicitly close the previous
  <blockquote> tag.

   Alice said: <blockquote>Bob said: <blockquote>Blah
    should NOT be transformed into:
   Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah

  Some tags can be nested, but the nesting is reset by the
  interposition of other tags. For instance, a <tr> tag should
  implicitly close the previous <tr> tag within the same <table>,
  but not close a <tr> tag in another table.

   <table><tr>Blah<tr>Blah
    should be transformed into:
   <table><tr>Blah</tr><tr>Blah
    but,
   <tr>Blah<table><tr>Blah
    should NOT be transformed into
   <tr>Blah<table></tr><tr>Blah

Differing assumptions about tag nesting rules are a major source
of problems with the BeautifulSoup class. If BeautifulSoup is not
treating as nestable a tag your page author treats as nestable,
try ICantBelieveItsBeautifulSoup, MinimalSoup, or
BeautifulStoneSoup before writing your own subclass.

Definition at line 1275 of file BeautifulSoup.py.


Public Member Functions

def __getattr__
def __init__
def __init__
def endData
def extract
def findAllNext
def findAllPrevious
def findNext
def findNextSibling
def findNextSiblings
def findParent
def findParents
def findPrevious
def findPreviousSibling
def findPreviousSiblings
def handle_charref
def handle_comment
def handle_data
def handle_decl
def handle_entityref
def handle_pi
def insert
def isSelfClosingTag
def nextGenerator
def nextSiblingGenerator
def parentGenerator
def parse_declaration
def popTag
def previousGenerator
def previousSiblingGenerator
def pushTag
def replaceWith
def reset
def setup
def start_meta
def substituteEncoding
def toEncoding
def unknown_endtag
def unknown_starttag

Public Attributes

 convertHTMLEntities
 convertXMLEntities
 currentData
 currentTag
 declaredHTMLEncoding
 fromEncoding
 hidden
 HTML_ENTITIES
 instanceSelfClosingTags
 literal
 markup
 markupMassage
 next
 nextSibling
 originalEncoding
 parent
 parseOnlyThese
 previous
 previousSibling
 quoteStack
 smartQuotesTo
 tagStack
 XML_ENTITIES

Static Public Attributes

list ALL_ENTITIES = [HTML_ENTITIES, XML_ENTITIES]
tuple CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)")
 fetchNextSiblings = findNextSiblings
 fetchParents = findParents
 fetchPrevious = findAllPrevious
 fetchPreviousSiblings = findPreviousSiblings
string HTML_ENTITIES = "html"
list MARKUP_MASSAGE
list NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
list NESTABLE_INLINE_TAGS
dictionary NESTABLE_LIST_TAGS
dictionary NESTABLE_TABLE_TAGS
tuple NESTABLE_TAGS
list NON_NESTABLE_BLOCK_TAGS = ['address', 'form', 'p', 'pre']
dictionary QUOTE_TAGS = {'script': None}
tuple RESET_NESTING_TAGS
string ROOT_TAG_NAME = u'[document]'
tuple SELF_CLOSING_TAGS
string XML_ENTITIES = "xml"

The documentation for this class was generated from the following file:

Generated by  Doxygen 1.6.0   Back to index