Soupstars 🍲 ⭐ 💥¶
Soupstars makes it easier than ever to build web parsers in Python.
Install it with pip.
pip install soupstars
Let’s go!
Quickstart¶
You need two objects to get started.
>>> from soupstars import Parser, serialize
We’ll build a parser to extract data from a github page.
>>> class GithubParser(Parser):
... "Parse data from a github page"
...
... @serialize
... def title(self):
... return str(self.h1.text.strip())
Now all we need is a github web page to parse.
>>> parser = GithubParser("https://github.com/tjwaterman99/soupstars")
Let’s see what we’ve got!
>>> parser.to_dict()
{'title': 'tjwaterman99/soupstars'}
You’re now ready to start building your own web parsers with soupstars. Nice job. 🍻
Going further¶
We recommend starting with the examples.
Examples¶
Soupstars includes various examples that you can reference or use directly.
NYTimes¶
Extract article links and article metadata from nytimes.com
-
class
soupstars.examples.nytimes.
NytimesArticleParser
(url)[source]¶ Parse attributes from a NY times article.
>>> from soupstars.examples.nytimes import NytimesArticleParser
The author(s) of the article.
For more specific questions about using the library, refer to the api documentation.
API¶
The primary Parser class and serialize decorator are available from the models and serializers modules.
>>> from soupstars.models import Parser
>>> from soupstars.serializers import serialize
Those objects are also on the top-level api.
>>> from soupstars import Parser, serialize
Models¶
The primary model provided by soupstars is the Parser class. It should generally be subclassed when building your own parsers.
When you initialize a parser with a url, it automatically downloads the webpage at that url and stores both the request and response as attributes.
>>> from soupstars import Parser
>>> class MyParser(Parser):
... @serialize
... def item(self):
... return 'An item!'
>>> parser = MyParser('https://jsonplaceholder.typicode.com/todos/1')
>>> print(parser.response)
<Response [200]>
>>> print(parser.request)
<PreparedRequest [GET]>
Serializers¶
Serializers help convert parsers into storable objects. The functions defined in this module are used to instruct soupstars about how to perform the serialization.
-
soupstars.serializers.
serialize
(function)[source]¶ Decorating a function defined on a parser with serialize instructs soupstars to include that function’s return value when building its own serialization.
>>> from soupstars import Parser, serialize >>> class MyParser(Parser): ... @serialize ... def length(self): ... return len(self.response.content) ... >>> parser = MyParser('https://jsonplaceholder.typicode.com/todos/1') >>> parser.serializer_names() ['length'] >>> 'length' in parser.to_dict() True