diff options
Diffstat (limited to 'import-layers/yocto-poky/bitbake/lib/bs4/NEWS.txt')
-rw-r--r-- | import-layers/yocto-poky/bitbake/lib/bs4/NEWS.txt | 1066 |
1 files changed, 0 insertions, 1066 deletions
diff --git a/import-layers/yocto-poky/bitbake/lib/bs4/NEWS.txt b/import-layers/yocto-poky/bitbake/lib/bs4/NEWS.txt deleted file mode 100644 index 88a60a245..000000000 --- a/import-layers/yocto-poky/bitbake/lib/bs4/NEWS.txt +++ /dev/null @@ -1,1066 +0,0 @@ -= 4.3.2 (20131002) = - -* Fixed a bug in which short Unicode input was improperly encoded to - ASCII when checking whether or not it was the name of a file on - disk. [bug=1227016] - -* Fixed a crash when a short input contains data not valid in - filenames. [bug=1232604] - -* Fixed a bug that caused Unicode data put into UnicodeDammit to - return None instead of the original data. [bug=1214983] - -* Combined two tests to stop a spurious test failure when tests are - run by nosetests. [bug=1212445] - -= 4.3.1 (20130815) = - -* Fixed yet another problem with the html5lib tree builder, caused by - html5lib's tendency to rearrange the tree during - parsing. [bug=1189267] - -* Fixed a bug that caused the optimized version of find_all() to - return nothing. [bug=1212655] - -= 4.3.0 (20130812) = - -* Instead of converting incoming data to Unicode and feeding it to the - lxml tree builder in chunks, Beautiful Soup now makes successive - guesses at the encoding of the incoming data, and tells lxml to - parse the data as that encoding. Giving lxml more control over the - parsing process improves performance and avoids a number of bugs and - issues with the lxml parser which had previously required elaborate - workarounds: - - - An issue in which lxml refuses to parse Unicode strings on some - systems. [bug=1180527] - - - A returning bug that truncated documents longer than a (very - small) size. [bug=963880] - - - A returning bug in which extra spaces were added to a document if - the document defined a charset other than UTF-8. [bug=972466] - - This required a major overhaul of the tree builder architecture. If - you wrote your own tree builder and didn't tell me, you'll need to - modify your prepare_markup() method. - -* The UnicodeDammit code that makes guesses at encodings has been - split into its own class, EncodingDetector. A lot of apparently - redundant code has been removed from Unicode, Dammit, and some - undocumented features have also been removed. - -* Beautiful Soup will issue a warning if instead of markup you pass it - a URL or the name of a file on disk (a common beginner's mistake). - -* A number of optimizations improve the performance of the lxml tree - builder by about 33%, the html.parser tree builder by about 20%, and - the html5lib tree builder by about 15%. - -* All find_all calls should now return a ResultSet object. Patch by - Aaron DeVore. [bug=1194034] - -= 4.2.1 (20130531) = - -* The default XML formatter will now replace ampersands even if they - appear to be part of entities. That is, "<" will become - "&lt;". The old code was left over from Beautiful Soup 3, which - didn't always turn entities into Unicode characters. - - If you really want the old behavior (maybe because you add new - strings to the tree, those strings include entities, and you want - the formatter to leave them alone on output), it can be found in - EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183] - -* Gave new_string() the ability to create subclasses of - NavigableString. [bug=1181986] - -* Fixed another bug by which the html5lib tree builder could create a - disconnected tree. [bug=1182089] - -* The .previous_element of a BeautifulSoup object is now always None, - not the last element to be parsed. [bug=1182089] - -* Fixed test failures when lxml is not installed. [bug=1181589] - -* html5lib now supports Python 3. Fixed some Python 2-specific - code in the html5lib test suite. [bug=1181624] - -* The html.parser treebuilder can now handle numeric attributes in - text when the hexidecimal name of the attribute starts with a - capital X. Patch by Tim Shirley. [bug=1186242] - -= 4.2.0 (20130514) = - -* The Tag.select() method now supports a much wider variety of CSS - selectors. - - - Added support for the adjacent sibling combinator (+) and the - general sibling combinator (~). Tests by "liquider". [bug=1082144] - - - The combinators (>, +, and ~) can now combine with any supported - selector, not just one that selects based on tag name. - - - Added limited support for the "nth-of-type" pseudo-class. Code - by Sven Slootweg. [bug=1109952] - -* The BeautifulSoup class is now aliased to "_s" and "_soup", making - it quicker to type the import statement in an interactive session: - - from bs4 import _s - or - from bs4 import _soup - - The alias may change in the future, so don't use this in code you're - going to run more than once. - -* Added the 'diagnose' submodule, which includes several useful - functions for reporting problems and doing tech support. - - - diagnose(data) tries the given markup on every installed parser, - reporting exceptions and displaying successes. If a parser is not - installed, diagnose() mentions this fact. - - - lxml_trace(data, html=True) runs the given markup through lxml's - XML parser or HTML parser, and prints out the parser events as - they happen. This helps you quickly determine whether a given - problem occurs in lxml code or Beautiful Soup code. - - - htmlparser_trace(data) is the same thing, but for Python's - built-in HTMLParser class. - -* In an HTML document, the contents of a <script> or <style> tag will - no longer undergo entity substitution by default. XML documents work - the same way they did before. [bug=1085953] - -* Methods like get_text() and properties like .strings now only give - you strings that are visible in the document--no comments or - processing commands. [bug=1050164] - -* The prettify() method now leaves the contents of <pre> tags - alone. [bug=1095654] - -* Fix a bug in the html5lib treebuilder which sometimes created - disconnected trees. [bug=1039527] - -* Fix a bug in the lxml treebuilder which crashed when a tag included - an attribute from the predefined "xml:" namespace. [bug=1065617] - -* Fix a bug by which keyword arguments to find_parent() were not - being passed on. [bug=1126734] - -* Stop a crash when unwisely messing with a tag that's been - decomposed. [bug=1097699] - -* Now that lxml's segfault on invalid doctype has been fixed, fixed a - corresponding problem on the Beautiful Soup end that was previously - invisible. [bug=984936] - -* Fixed an exception when an overspecified CSS selector didn't match - anything. Code by Stefaan Lippens. [bug=1168167] - -= 4.1.3 (20120820) = - -* Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious - test failure caused by the lousy HTMLParser in those - versions. [bug=1038503] - -* Raise a more specific error (FeatureNotFound) when a requested - parser or parser feature is not installed. Raise NotImplementedError - instead of ValueError when the user calls insert_before() or - insert_after() on the BeautifulSoup object itself. Patch by Aaron - Devore. [bug=1038301] - -= 4.1.2 (20120817) = - -* As per PEP-8, allow searching by CSS class using the 'class_' - keyword argument. [bug=1037624] - -* Display namespace prefixes for namespaced attribute names, instead of - the fully-qualified names given by the lxml parser. [bug=1037597] - -* Fixed a crash on encoding when an attribute name contained - non-ASCII characters. - -* When sniffing encodings, if the cchardet library is installed, - Beautiful Soup uses it instead of chardet. cchardet is much - faster. [bug=1020748] - -* Use logging.warning() instead of warning.warn() to notify the user - that characters were replaced with REPLACEMENT - CHARACTER. [bug=1013862] - -= 4.1.1 (20120703) = - -* Fixed an html5lib tree builder crash which happened when html5lib - moved a tag with a multivalued attribute from one part of the tree - to another. [bug=1019603] - -* Correctly display closing tags with an XML namespace declared. Patch - by Andreas Kostyrka. [bug=1019635] - -* Fixed a typo that made parsing significantly slower than it should - have been, and also waited too long to close tags with XML - namespaces. [bug=1020268] - -* get_text() now returns an empty Unicode string if there is no text, - rather than an empty bytestring. [bug=1020387] - -= 4.1.0 (20120529) = - -* Added experimental support for fixing Windows-1252 characters - embedded in UTF-8 documents. (UnicodeDammit.detwingle()) - -* Fixed the handling of " with the built-in parser. [bug=993871] - -* Comments, processing instructions, document type declarations, and - markup declarations are now treated as preformatted strings, the way - CData blocks are. [bug=1001025] - -* Fixed a bug with the lxml treebuilder that prevented the user from - adding attributes to a tag that didn't originally have - attributes. [bug=1002378] Thanks to Oliver Beattie for the patch. - -* Fixed some edge-case bugs having to do with inserting an element - into a tag it's already inside, and replacing one of a tag's - children with another. [bug=997529] - -* Added the ability to search for attribute values specified in UTF-8. [bug=1003974] - - This caused a major refactoring of the search code. All the tests - pass, but it's possible that some searches will behave differently. - -= 4.0.5 (20120427) = - -* Added a new method, wrap(), which wraps an element in a tag. - -* Renamed replace_with_children() to unwrap(), which is easier to - understand and also the jQuery name of the function. - -* Made encoding substitution in <meta> tags completely transparent (no - more %SOUP-ENCODING%). - -* Fixed a bug in decoding data that contained a byte-order mark, such - as data encoded in UTF-16LE. [bug=988980] - -* Fixed a bug that made the HTMLParser treebuilder generate XML - definitions ending with two question marks instead of - one. [bug=984258] - -* Upon document generation, CData objects are no longer run through - the formatter. [bug=988905] - -* The test suite now passes when lxml is not installed, whether or not - html5lib is installed. [bug=987004] - -* Print a warning on HTMLParseErrors to let people know they should - install a better parser library. - -= 4.0.4 (20120416) = - -* Fixed a bug that sometimes created disconnected trees. - -* Fixed a bug with the string setter that moved a string around the - tree instead of copying it. [bug=983050] - -* Attribute values are now run through the provided output formatter. - Previously they were always run through the 'minimal' formatter. In - the future I may make it possible to specify different formatters - for attribute values and strings, but for now, consistent behavior - is better than inconsistent behavior. [bug=980237] - -* Added the missing renderContents method from Beautiful Soup 3. Also - added an encode_contents() method to go along with decode_contents(). - -* Give a more useful error when the user tries to run the Python 2 - version of BS under Python 3. - -* UnicodeDammit can now convert Microsoft smart quotes to ASCII with - UnicodeDammit(markup, smart_quotes_to="ascii"). - -= 4.0.3 (20120403) = - -* Fixed a typo that caused some versions of Python 3 to convert the - Beautiful Soup codebase incorrectly. - -* Got rid of the 4.0.2 workaround for HTML documents--it was - unnecessary and the workaround was triggering a (possibly different, - but related) bug in lxml. [bug=972466] - -= 4.0.2 (20120326) = - -* Worked around a possible bug in lxml that prevents non-tiny XML - documents from being parsed. [bug=963880, bug=963936] - -* Fixed a bug where specifying `text` while also searching for a tag - only worked if `text` wanted an exact string match. [bug=955942] - -= 4.0.1 (20120314) = - -* This is the first official release of Beautiful Soup 4. There is no - 4.0.0 release, to eliminate any possibility that packaging software - might treat "4.0.0" as being an earlier version than "4.0.0b10". - -* Brought BS up to date with the latest release of soupselect, adding - CSS selector support for direct descendant matches and multiple CSS - class matches. - -= 4.0.0b10 (20120302) = - -* Added support for simple CSS selectors, taken from the soupselect project. - -* Fixed a crash when using html5lib. [bug=943246] - -* In HTML5-style <meta charset="foo"> tags, the value of the "charset" - attribute is now replaced with the appropriate encoding on - output. [bug=942714] - -* Fixed a bug that caused calling a tag to sometimes call find_all() - with the wrong arguments. [bug=944426] - -* For backwards compatibility, brought back the BeautifulStoneSoup - class as a deprecated wrapper around BeautifulSoup. - -= 4.0.0b9 (20120228) = - -* Fixed the string representation of DOCTYPEs that have both a public - ID and a system ID. - -* Fixed the generated XML declaration. - -* Renamed Tag.nsprefix to Tag.prefix, for consistency with - NamespacedAttribute. - -* Fixed a test failure that occured on Python 3.x when chardet was - installed. - -* Made prettify() return Unicode by default, so it will look nice on - Python 3 when passed into print(). - -= 4.0.0b8 (20120224) = - -* All tree builders now preserve namespace information in the - documents they parse. If you use the html5lib parser or lxml's XML - parser, you can access the namespace URL for a tag as tag.namespace. - - However, there is no special support for namespace-oriented - searching or tree manipulation. When you search the tree, you need - to use namespace prefixes exactly as they're used in the original - document. - -* The string representation of a DOCTYPE always ends in a newline. - -* Issue a warning if the user tries to use a SoupStrainer in - conjunction with the html5lib tree builder, which doesn't support - them. - -= 4.0.0b7 (20120223) = - -* Upon decoding to string, any characters that can't be represented in - your chosen encoding will be converted into numeric XML entity - references. - -* Issue a warning if characters were replaced with REPLACEMENT - CHARACTER during Unicode conversion. - -* Restored compatibility with Python 2.6. - -* The install process no longer installs docs or auxillary text files. - -* It's now possible to deepcopy a BeautifulSoup object created with - Python's built-in HTML parser. - -* About 100 unit tests that "test" the behavior of various parsers on - invalid markup have been removed. Legitimate changes to those - parsers caused these tests to fail, indicating that perhaps - Beautiful Soup should not test the behavior of foreign - libraries. - - The problematic unit tests have been reformulated as informational - comparisons generated by the script - scripts/demonstrate_parser_differences.py. - - This makes Beautiful Soup compatible with html5lib version 0.95 and - future versions of HTMLParser. - -= 4.0.0b6 (20120216) = - -* Multi-valued attributes like "class" always have a list of values, - even if there's only one value in the list. - -* Added a number of multi-valued attributes defined in HTML5. - -* Stopped generating a space before the slash that closes an - empty-element tag. This may come back if I add a special XHTML mode - (http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty - useless. - -* Passing text along with tag-specific arguments to a find* method: - - find("a", text="Click here") - - will find tags that contain the given text as their - .string. Previously, the tag-specific arguments were ignored and - only strings were searched. - -* Fixed a bug that caused the html5lib tree builder to build a - partially disconnected tree. Generally cleaned up the html5lib tree - builder. - -* If you restrict a multi-valued attribute like "class" to a string - that contains spaces, Beautiful Soup will only consider it a match - if the values correspond to that specific string. - -= 4.0.0b5 (20120209) = - -* Rationalized Beautiful Soup's treatment of CSS class. A tag - belonging to multiple CSS classes is treated as having a list of - values for the 'class' attribute. Searching for a CSS class will - match *any* of the CSS classes. - - This actually affects all attributes that the HTML standard defines - as taking multiple values (class, rel, rev, archive, accept-charset, - and headers), but 'class' is by far the most common. [bug=41034] - -* If you pass anything other than a dictionary as the second argument - to one of the find* methods, it'll assume you want to use that - object to search against a tag's CSS classes. Previously this only - worked if you passed in a string. - -* Fixed a bug that caused a crash when you passed a dictionary as an - attribute value (possibly because you mistyped "attrs"). [bug=842419] - -* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags - like <meta charset="utf-8" />. [bug=837268] - -* If Unicode, Dammit can't figure out a consistent encoding for a - page, it will try each of its guesses again, with errors="replace" - instead of errors="strict". This may mean that some data gets - replaced with REPLACEMENT CHARACTER, but at least most of it will - get turned into Unicode. [bug=754903] - -* Patched over a bug in html5lib (?) that was crashing Beautiful Soup - on certain kinds of markup. [bug=838800] - -* Fixed a bug that wrecked the tree if you replaced an element with an - empty string. [bug=728697] - -* Improved Unicode, Dammit's behavior when you give it Unicode to - begin with. - -= 4.0.0b4 (20120208) = - -* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag() - -* BeautifulSoup.new_tag() will follow the rules of whatever - tree-builder was used to create the original BeautifulSoup object. A - new <p> tag will look like "<p />" if the soup object was created to - parse XML, but it will look like "<p></p>" if the soup object was - created to parse HTML. - -* We pass in strict=False to html.parser on Python 3, greatly - improving html.parser's ability to handle bad HTML. - -* We also monkeypatch a serious bug in html.parser that made - strict=False disastrous on Python 3.2.2. - -* Replaced the "substitute_html_entities" argument with the - more general "formatter" argument. - -* Bare ampersands and angle brackets are always converted to XML - entities unless the user prevents it. - -* Added PageElement.insert_before() and PageElement.insert_after(), - which let you put an element into the parse tree with respect to - some other element. - -* Raise an exception when the user tries to do something nonsensical - like insert a tag into itself. - - -= 4.0.0b3 (20120203) = - -Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful -Soup's custom HTML parser in favor of a system that lets you write a -little glue code and plug in any HTML or XML parser you want. - -Beautiful Soup 4.0 comes with glue code for four parsers: - - * Python's standard HTMLParser (html.parser in Python 3) - * lxml's HTML and XML parsers - * html5lib's HTML parser - -HTMLParser is the default, but I recommend you install lxml if you -can. - -For complete documentation, see the Sphinx documentation in -bs4/doc/source/. What follows is a summary of the changes from -Beautiful Soup 3. - -=== The module name has changed === - -Previously you imported the BeautifulSoup class from a module also -called BeautifulSoup. To save keystrokes and make it clear which -version of the API is in use, the module is now called 'bs4': - - >>> from bs4 import BeautifulSoup - -=== It works with Python 3 === - -Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was -so bad that it barely worked at all. Beautiful Soup 4 works with -Python 3, and since its parser is pluggable, you don't sacrifice -quality. - -Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3 -support to the finish line. Ezio Melotti is also to thank for greatly -improving the HTML parser that comes with Python 3.2. - -=== CDATA sections are normal text, if they're understood at all. === - -Currently, the lxml and html5lib HTML parsers ignore CDATA sections in -markup: - - <p><![CDATA[foo]]></p> => <p></p> - -A future version of html5lib will turn CDATA sections into text nodes, -but only within tags like <svg> and <math>: - - <svg><![CDATA[foo]]></svg> => <p>foo</p> - -The default XML parser (which uses lxml behind the scenes) turns CDATA -sections into ordinary text elements: - - <p><![CDATA[foo]]></p> => <p>foo</p> - -In theory it's possible to preserve the CDATA sections when using the -XML parser, but I don't see how to get it to work in practice. - -=== Miscellaneous other stuff === - -If the BeautifulSoup instance has .is_xml set to True, an appropriate -XML declaration will be emitted when the tree is transformed into a -string: - - <?xml version="1.0" encoding="utf-8"> - <markup> - ... - </markup> - -The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree -builders set it to False. If you want to parse XHTML with an HTML -parser, you can set it manually. - - -= 3.2.0 = - -The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2 -to make it obvious which one you should use. - -= 3.1.0 = - -A hybrid version that supports 2.4 and can be automatically converted -to run under Python 3.0. There are three backwards-incompatible -changes you should be aware of, but no new features or deliberate -behavior changes. - -1. str() may no longer do what you want. This is because the meaning -of str() inverts between Python 2 and 3; in Python 2 it gives you a -byte string, in Python 3 it gives you a Unicode string. - -The effect of this is that you can't pass an encoding to .__str__ -anymore. Use encode() to get a string and decode() to get Unicode, and -you'll be ready (well, readier) for Python 3. - -2. Beautiful Soup is now based on HTMLParser rather than SGMLParser, -which is gone in Python 3. There's some bad HTML that SGMLParser -handled but HTMLParser doesn't, usually to do with attribute values -that aren't closed or have brackets inside them: - - <a href="foo</a>, </a><a href="bar">baz</a> - <a b="<a>">', '<a b="<a>"></a><a>"></a> - -A later version of Beautiful Soup will allow you to plug in different -parsers to make tradeoffs between speed and the ability to handle bad -HTML. - -3. In Python 3 (but not Python 2), HTMLParser converts entities within -attributes to the corresponding Unicode characters. In Python 2 it's -possible to parse this string and leave the é intact. - - <a href="http://crummy.com?sacré&bleu"> - -In Python 3, the é is always converted to \xe9 during -parsing. - - -= 3.0.7a = - -Added an import that makes BS work in Python 2.3. - - -= 3.0.7 = - -Fixed a UnicodeDecodeError when unpickling documents that contain -non-ASCII characters. - -Fixed a TypeError that occured in some circumstances when a tag -contained no text. - -Jump through hoops to avoid the use of chardet, which can be extremely -slow in some circumstances. UTF-8 documents should never trigger the -use of chardet. - -Whitespace is preserved inside <pre> and <textarea> tags that contain -nothing but whitespace. - -Beautiful Soup can now parse a doctype that's scoped to an XML namespace. - - -= 3.0.6 = - -Got rid of a very old debug line that prevented chardet from working. - -Added a Tag.decompose() method that completely disconnects a tree or a -subset of a tree, breaking it up into bite-sized pieces that are -easy for the garbage collecter to collect. - -Tag.extract() now returns the tag that was extracted. - -Tag.findNext() now does something with the keyword arguments you pass -it instead of dropping them on the floor. - -Fixed a Unicode conversion bug. - -Fixed a bug that garbled some <meta> tags when rewriting them. - - -= 3.0.5 = - -Soup objects can now be pickled, and copied with copy.deepcopy. - -Tag.append now works properly on existing BS objects. (It wasn't -originally intended for outside use, but it can be now.) (Giles -Radford) - -Passing in a nonexistent encoding will no longer crash the parser on -Python 2.4 (John Nagle). - -Fixed an underlying bug in SGMLParser that thinks ASCII has 255 -characters instead of 127 (John Nagle). - -Entities are converted more consistently to Unicode characters. - -Entity references in attribute values are now converted to Unicode -characters when appropriate. Numeric entities are always converted, -because SGMLParser always converts them outside of attribute values. - -ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to -XHTML_ENTITIES. - -The regular expression for bare ampersands was too loose. In some -cases ampersands were not being escaped. (Sam Ruby?) - -Non-breaking spaces and other special Unicode space characters are no -longer folded to ASCII spaces. (Robert Leftwich) - -Information inside a TEXTAREA tag is now parsed literally, not as HTML -tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang) - -= 3.0.4 = - -Fixed a bug that crashed Unicode conversion in some cases. - -Fixed a bug that prevented UnicodeDammit from being used as a -general-purpose data scrubber. - -Fixed some unit test failures when running against Python 2.5. - -When considering whether to convert smart quotes, UnicodeDammit now -looks at the original encoding in a case-insensitive way. - -= 3.0.3 (20060606) = - -Beautiful Soup is now usable as a way to clean up invalid XML/HTML (be -sure to pass in an appropriate value for convertEntities, or XML/HTML -entities might stick around that aren't valid in HTML/XML). The result -may not validate, but it should be good enough to not choke a -real-world XML parser. Specifically, the output of a properly -constructed soup object should always be valid as part of an XML -document, but parts may be missing if they were missing in the -original. As always, if the input is valid XML, the output will also -be valid. - -= 3.0.2 (20060602) = - -Previously, Beautiful Soup correctly handled attribute values that -contained embedded quotes (sometimes by escaping), but not other kinds -of XML character. Now, it correctly handles or escapes all special XML -characters in attribute values. - -I aliased methods to the 2.x names (fetch, find, findText, etc.) for -backwards compatibility purposes. Those names are deprecated and if I -ever do a 4.0 I will remove them. I will, I tell you! - -Fixed a bug where the findAll method wasn't passing along any keyword -arguments. - -When run from the command line, Beautiful Soup now acts as an HTML -pretty-printer, not an XML pretty-printer. - -= 3.0.1 (20060530) = - -Reintroduced the "fetch by CSS class" shortcut. I thought keyword -arguments would replace it, but they don't. You can't call soup('a', -class='foo') because class is a Python keyword. - -If Beautiful Soup encounters a meta tag that declares the encoding, -but a SoupStrainer tells it not to parse that tag, Beautiful Soup will -no longer try to rewrite the meta tag to mention the new -encoding. Basically, this makes SoupStrainers work in real-world -applications instead of crashing the parser. - -= 3.0.0 "Who would not give all else for two p" (20060528) = - -This release is not backward-compatible with previous releases. If -you've got code written with a previous version of the library, go -ahead and keep using it, unless one of the features mentioned here -really makes your life easier. Since the library is self-contained, -you can include an old copy of the library in your old applications, -and use the new version for everything else. - -The documentation has been rewritten and greatly expanded with many -more examples. - -Beautiful Soup autodetects the encoding of a document (or uses the one -you specify), and converts it from its native encoding to -Unicode. Internally, it only deals with Unicode strings. When you -print out the document, it converts to UTF-8 (or another encoding you -specify). [Doc reference] - -It's now easy to make large-scale changes to the parse tree without -screwing up the navigation members. The methods are extract, -replaceWith, and insert. [Doc reference. See also Improving Memory -Usage with extract] - -Passing True in as an attribute value gives you tags that have any -value for that attribute. You don't have to create a regular -expression. Passing None for an attribute value gives you tags that -don't have that attribute at all. - -Tag objects now know whether or not they're self-closing. This avoids -the problem where Beautiful Soup thought that tags like <BR /> were -self-closing even in XML documents. You can customize the self-closing -tags for a parser object by passing them in as a list of -selfClosingTags: you don't have to subclass anymore. - -There's a new built-in parser, MinimalSoup, which has most of -BeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc -reference] - -You can use a SoupStrainer to tell Beautiful Soup to parse only part -of a document. This saves time and memory, often making Beautiful Soup -about as fast as a custom-built SGMLParser subclass. [Doc reference, -SoupStrainer reference] - -You can (usually) use keyword arguments instead of passing a -dictionary of attributes to a search method. That is, you can replace -soup(args={"id" : "5"}) with soup(id="5"). You can still use args if -(for instance) you need to find an attribute whose name clashes with -the name of an argument to findAll. [Doc reference: **kwargs attrs] - -The method names have changed to the better method names used in -Rubyful Soup. Instead of find methods and fetch methods, there are -only find methods. Instead of a scheme where you can't remember which -method finds one element and which one finds them all, we have find -and findAll. In general, if the method name mentions All or a plural -noun (eg. findNextSiblings), then it finds many elements -method. Otherwise, it only finds one element. [Doc reference] - -Some of the argument names have been renamed for clarity. For instance -avoidParserProblems is now parserMassage. - -Beautiful Soup no longer implements a feed method. You need to pass a -string or a filehandle into the soup constructor, not with feed after -the soup has been created. There is still a feed method, but it's the -feed method implemented by SGMLParser and calling it will bypass -Beautiful Soup and cause problems. - -The NavigableText class has been renamed to NavigableString. There is -no NavigableUnicodeString anymore, because every string inside a -Beautiful Soup parse tree is a Unicode string. - -findText and fetchText are gone. Just pass a text argument into find -or findAll. - -Null was more trouble than it was worth, so I got rid of it. Anything -that used to return Null now returns None. - -Special XML constructs like comments and CDATA now have their own -NavigableString subclasses, instead of being treated as oddly-formed -data. If you parse a document that contains CDATA and write it back -out, the CDATA will still be there. - -When you're parsing a document, you can get Beautiful Soup to convert -XML or HTML entities into the corresponding Unicode characters. [Doc -reference] - -= 2.1.1 (20050918) = - -Fixed a serious performance bug in BeautifulStoneSoup which was -causing parsing to be incredibly slow. - -Corrected several entities that were previously being incorrectly -translated from Microsoft smart-quote-like characters. - -Fixed a bug that was breaking text fetch. - -Fixed a bug that crashed the parser when text chunks that look like -HTML tag names showed up within a SCRIPT tag. - -THEAD, TBODY, and TFOOT tags are now nestable within TABLE -tags. Nested tables should parse more sensibly now. - -BASE is now considered a self-closing tag. - -= 2.1.0 "Game, or any other dish?" (20050504) = - -Added a wide variety of new search methods which, given a starting -point inside the tree, follow a particular navigation member (like -nextSibling) over and over again, looking for Tag and NavigableText -objects that match certain criteria. The new methods are findNext, -fetchNext, findPrevious, fetchPrevious, findNextSibling, -fetchNextSiblings, findPreviousSibling, fetchPreviousSiblings, -findParent, and fetchParents. All of these use the same basic code -used by first and fetch, so you can pass your weird ways of matching -things into these methods. - -The fetch method and its derivatives now accept a limit argument. - -You can now pass keyword arguments when calling a Tag object as though -it were a method. - -Fixed a bug that caused all hand-created tags to share a single set of -attributes. - -= 2.0.3 (20050501) = - -Fixed Python 2.2 support for iterators. - -Fixed a bug that gave the wrong representation to tags within quote -tags like <script>. - -Took some code from Mark Pilgrim that treats CDATA declarations as -data instead of ignoring them. - -Beautiful Soup's setup.py will now do an install even if the unit -tests fail. It won't build a source distribution if the unit tests -fail, so I can't release a new version unless they pass. - -= 2.0.2 (20050416) = - -Added the unit tests in a separate module, and packaged it with -distutils. - -Fixed a bug that sometimes caused renderContents() to return a Unicode -string even if there was no Unicode in the original string. - -Added the done() method, which closes all of the parser's open -tags. It gets called automatically when you pass in some text to the -constructor of a parser class; otherwise you must call it yourself. - -Reinstated some backwards compatibility with 1.x versions: referencing -the string member of a NavigableText object returns the NavigableText -object instead of throwing an error. - -= 2.0.1 (20050412) = - -Fixed a bug that caused bad results when you tried to reference a tag -name shorter than 3 characters as a member of a Tag, eg. tag.table.td. - -Made sure all Tags have the 'hidden' attribute so that an attempt to -access tag.hidden doesn't spawn an attempt to find a tag named -'hidden'. - -Fixed a bug in the comparison operator. - -= 2.0.0 "Who cares for fish?" (20050410) - -Beautiful Soup version 1 was very useful but also pretty stupid. I -originally wrote it without noticing any of the problems inherent in -trying to build a parse tree out of ambiguous HTML tags. This version -solves all of those problems to my satisfaction. It also adds many new -clever things to make up for the removal of the stupid things. - -== Parsing == - -The parser logic has been greatly improved, and the BeautifulSoup -class should much more reliably yield a parse tree that looks like -what the page author intended. For a particular class of odd edge -cases that now causes problems, there is a new class, -ICantBelieveItsBeautifulSoup. - -By default, Beautiful Soup now performs some cleanup operations on -text before parsing it. This is to avoid common problems with bad -definitions and self-closing tags that crash SGMLParser. You can -provide your own set of cleanup operations, or turn it off -altogether. The cleanup operations include fixing self-closing tags -that don't close, and replacing Microsoft smart quotes and similar -characters with their HTML entity equivalents. - -You can now get a pretty-print version of parsed HTML to get a visual -picture of how Beautiful Soup parses it, with the Tag.prettify() -method. - -== Strings and Unicode == - -There are separate NavigableText subclasses for ASCII and Unicode -strings. These classes directly subclass the corresponding base data -types. This means you can treat NavigableText objects as strings -instead of having to call methods on them to get the strings. - -str() on a Tag always returns a string, and unicode() always returns -Unicode. Previously it was inconsistent. - -== Tree traversal == - -In a first() or fetch() call, the tag name or the desired value of an -attribute can now be any of the following: - - * A string (matches that specific tag or that specific attribute value) - * A list of strings (matches any tag or attribute value in the list) - * A compiled regular expression object (matches any tag or attribute - value that matches the regular expression) - * A callable object that takes the Tag object or attribute value as a - string. It returns None/false/empty string if the given string - doesn't match, and any other value if it does. - -This is much easier to use than SQL-style wildcards (see, regular -expressions are good for something). Because of this, I took out -SQL-style wildcards. I'll put them back if someone complains, but -their removal simplifies the code a lot. - -You can use fetch() and first() to search for text in the parse tree, -not just tags. There are new alias methods fetchText() and firstText() -designed for this purpose. As with searching for tags, you can pass in -a string, a regular expression object, or a method to match your text. - -If you pass in something besides a map to the attrs argument of -fetch() or first(), Beautiful Soup will assume you want to match that -thing against the "class" attribute. When you're scraping -well-structured HTML, this makes your code a lot cleaner. - -1.x and 2.x both let you call a Tag object as a shorthand for -fetch(). For instance, foo("bar") is a shorthand for -foo.fetch("bar"). In 2.x, you can also access a specially-named member -of a Tag object as a shorthand for first(). For instance, foo.barTag -is a shorthand for foo.first("bar"). By chaining these shortcuts you -traverse a tree in very little code: for header in -soup.bodyTag.pTag.tableTag('th'): - -If an element relationship (like parent or next) doesn't apply to a -tag, it'll now show up Null instead of None. first() will also return -Null if you ask it for a nonexistent tag. Null is an object that's -just like None, except you can do whatever you want to it and it'll -give you Null instead of throwing an error. - -This lets you do tree traversals like soup.htmlTag.headTag.titleTag -without having to worry if the intermediate stages are actually -there. Previously, if there was no 'head' tag in the document, headTag -in that instance would have been None, and accessing its 'titleTag' -member would have thrown an AttributeError. Now, you can get what you -want when it exists, and get Null when it doesn't, without having to -do a lot of conditionals checking to see if every stage is None. - -There are two new relations between page elements: previousSibling and -nextSibling. They reference the previous and next element at the same -level of the parse tree. For instance, if you have HTML like this: - - <p><ul><li>Foo<br /><li>Bar</ul> - -The first 'li' tag has a previousSibling of Null and its nextSibling -is the second 'li' tag. The second 'li' tag has a nextSibling of Null -and its previousSibling is the first 'li' tag. The previousSibling of -the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the -'br' tag. - -I took out the ability to use fetch() to find tags that have a -specific list of contents. See, I can't even explain it well. It was -really difficult to use, I never used it, and I don't think anyone -else ever used it. To the extent anyone did, they can probably use -fetchText() instead. If it turns out someone needs it I'll think of -another solution. - -== Tree manipulation == - -You can add new attributes to a tag, and delete attributes from a -tag. In 1.x you could only change a tag's existing attributes. - -== Porting Considerations == - -There are three changes in 2.0 that break old code: - -In the post-1.2 release you could pass in a function into fetch(). The -function took a string, the tag name. In 2.0, the function takes the -actual Tag object. - -It's no longer to pass in SQL-style wildcards to fetch(). Use a -regular expression instead. - -The different parsing algorithm means the parse tree may not be shaped -like you expect. This will only actually affect you if your code uses -one of the affected parts. I haven't run into this problem yet while -porting my code. - -= Between 1.2 and 2.0 = - -This is the release to get if you want Python 1.5 compatibility. - -The desired value of an attribute can now be any of the following: - - * A string - * A string with SQL-style wildcards - * A compiled RE object - * A callable that returns None/false/empty string if the given value - doesn't match, and any other value otherwise. - -This is much easier to use than SQL-style wildcards (see, regular -expressions are good for something). Because of this, I no longer -recommend you use SQL-style wildcards. They may go away in a future -release to clean up the code. - -Made Beautiful Soup handle processing instructions as text instead of -ignoring them. - -Applied patch from Richie Hindle (richie at entrian dot com) that -makes tag.string a shorthand for tag.contents[0].string when the tag -has only one string-owning child. - -Added still more nestable tags. The nestable tags thing won't work in -a lot of cases and needs to be rethought. - -Fixed an edge case where searching for "%foo" would match any string -shorter than "foo". - -= 1.2 "Who for such dainties would not stoop?" (20040708) = - -Applied patch from Ben Last (ben at benlast dot com) that made -Tag.renderContents() correctly handle Unicode. - -Made BeautifulStoneSoup even dumber by making it not implicitly close -a tag when another tag of the same type is encountered; only when an -actual closing tag is encountered. This change courtesy of Fuzzy (mike -at pcblokes dot com). BeautifulSoup still works as before. - -= 1.1 "Swimming in a hot tureen" = - -Added more 'nestable' tags. Changed popping semantics so that when a -nestable tag is encountered, tags are popped up to the previously -encountered nestable tag (of whatever kind). I will revert this if -enough people complain, but it should make more people's lives easier -than harder. This enhancement was suggested by Anthony Baxter (anthony -at interlink dot com dot au). - -= 1.0 "So rich and green" (20040420) = - -Initial release. |