Logo Search packages:      
Sourcecode: wapiti version File versions  Download package

def BeautifulSoup::BeautifulStoneSoup::__init__ (   self,
  markup = "",
  parseOnlyThese = None,
  fromEncoding = None,
  markupMassage = True,
  smartQuotesTo = XML_ENTITIES,
  convertEntities = None,
  selfClosingTags = None 
) [inherited]

The Soup object is initialized as the 'root tag', and the
provided markup (which can be a string or a file-like object)
is fed into the underlying parser. 

sgmllib will process most bad HTML, and the BeautifulSoup
class has some tricks for dealing with some HTML that kills
sgmllib, but Beautiful Soup can nonetheless choke or lose data
if your data uses self-closing tags or declarations
incorrectly.

By default, Beautiful Soup uses regexes to sanitize input,
avoiding the vast majority of these problems. If the problems
don't apply to you, pass in False for markupMassage, and
you'll get better performance.

The default parser massage techniques fix the two most common
instances of invalid HTML that choke sgmllib:

 <br/> (No space between name of closing tag and tag close)
 <! --Comment--> (Extraneous whitespace in declaration)

You can pass in a custom list of (RE object, replace method)
tuples to get Beautiful Soup to scrub your input the way you
want.

Definition at line 920 of file BeautifulSoup.py.

                                                            :
        """The Soup object is initialized as the 'root tag', and the
        provided markup (which can be a string or a file-like object)
        is fed into the underlying parser. 

        sgmllib will process most bad HTML, and the BeautifulSoup
        class has some tricks for dealing with some HTML that kills
        sgmllib, but Beautiful Soup can nonetheless choke or lose data
        if your data uses self-closing tags or declarations
        incorrectly.

        By default, Beautiful Soup uses regexes to sanitize input,
        avoiding the vast majority of these problems. If the problems
        don't apply to you, pass in False for markupMassage, and
        you'll get better performance.

        The default parser massage techniques fix the two most common
        instances of invalid HTML that choke sgmllib:

         <br/> (No space between name of closing tag and tag close)
         <! --Comment--> (Extraneous whitespace in declaration)

        You can pass in a custom list of (RE object, replace method)
        tuples to get Beautiful Soup to scrub your input the way you
        want."""

        self.parseOnlyThese = parseOnlyThese
        self.fromEncoding = fromEncoding
        self.smartQuotesTo = smartQuotesTo

        if convertEntities:
            # It doesn't make sense to convert encoded characters to
            # entities even while you're converting entities to Unicode.
            # Just convert it all to Unicode.
            self.smartQuotesTo = None

        if isList(convertEntities):
            self.convertHTMLEntities = self.HTML_ENTITIES in convertEntities
            self.convertXMLEntities = self.XML_ENTITIES in convertEntities
        else:
            self.convertHTMLEntities = self.HTML_ENTITIES == convertEntities
            self.convertXMLEntities = self.XML_ENTITIES == convertEntities

        self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags)
        SGMLParser.__init__(self)
            
        if hasattr(markup, 'read'):        # It's a file-type object.
            markup = markup.read()
        self.markup = markup
        self.markupMassage = markupMassage
        try:
            self._feed()
        except StopParsing:
            pass
        self.markup = None                 # The markup can now be GCed

    def _feed(self, inDocumentEncoding=None):


Generated by  Doxygen 1.6.0   Back to index