Soupstars π² β π₯ΒΆ
Soupstars makes it easier than ever to build web parsers in Python.
Install it with pip.
pip install soupstars
Letβs go!
QuickstartΒΆ
You need two objects to get started.
>>> from soupstars import Parser, serialize
Weβll build a parser to extract data from a github page.
>>> class GithubParser(Parser):
... "Parse data from a github page"
...
... @serialize
... def title(self):
... return str(self.h1.text.strip())
Now all we need is a github web page to parse.
>>> parser = GithubParser("https://github.com/tjwaterman99/soupstars")
Letβs see what weβve got!
>>> parser.to_dict()
{'title': 'tjwaterman99/soupstars'}
Youβre now ready to start building your own web parsers with soupstars. Nice job. π»
Going furtherΒΆ
We recommend starting with the examples.
ExamplesΒΆ
Soupstars includes various examples that you can reference or use directly.
NYTimesΒΆ
Extract article links and article metadata from nytimes.com
-
class
soupstars.examples.nytimes.
NytimesArticleParser
(url)[source]ΒΆ Parse attributes from a NY times article.
>>> from soupstars.examples.nytimes import NytimesArticleParser
The author(s) of the article.
EconomistΒΆ
Extract metadata from economist index and article pages
-
class
soupstars.examples.economist.
WeeklyIndexPages
(**kwargs)[source]ΒΆ Example model for storing the results of the parser
-
class
soupstars.examples.economist.
WeeklyIndexPageParser
(url)[source]ΒΆ Parse metadata from the weekly updated index pages
-
Model
ΒΆ alias of
WeeklyIndexPages
-
For more specific questions about using the library, refer to the api documentation.
APIΒΆ
The primary Parser class and serialize decorator are available from the models and serializers modules.
>>> from soupstars.models import Parser
>>> from soupstars.serializers import serialize
Those objects are also on the top-level api.
>>> from soupstars import Parser, serialize
ModelsΒΆ
The primary model provided by soupstars is the Parser class. It should generally be subclassed when building your own parsers.
When you initialize a parser with a url, it automatically downloads the webpage at that url and stores both the request and response as attributes.
>>> from soupstars import Parser, serialize
>>> class MyParser(Parser):
... @serialize
... def item(self):
... return 'An item!'
>>> parser = MyParser('https://jsonplaceholder.typicode.com/todos/1')
>>> print(parser.response)
<Response [200]>
>>> print(parser.request)
<PreparedRequest [GET]>
SerializersΒΆ
Serializers help convert parsers into storable objects. The functions defined in this module are used to instruct soupstars about how to perform the serialization.
-
soupstars.serializers.
serialize
(function)[source]ΒΆ Decorating a function defined on a parser with serialize instructs soupstars to include that functionβs return value when building its own serialization.
>>> from soupstars import Parser, serialize >>> class MyParser(Parser): ... @serialize ... def length(self): ... return len(self.response.content) ... >>> parser = MyParser('https://jsonplaceholder.typicode.com/todos/1') >>> parser.serializer_names() ['length'] >>> 'length' in parser.to_dict() True