Soupstars 🍲 ⭐ πŸ’₯ΒΆ

https://travis-ci.org/tjwaterman99/soupstars.svg?branch=master https://coveralls.io/repos/github/tjwaterman99/soupstars/badge.svg?branch=master https://readthedocs.org/projects/soupstars/badge/?version=latest https://badge.fury.io/py/soupstars.svg https://img.shields.io/pypi/pyversions/soupstars.svg

Soupstars makes it easier than ever to build web parsers in Python.

Install it with pip.

pip install soupstars

Let’s go!

QuickstartΒΆ

You need two objects to get started.

>>> from soupstars import Parser, serialize

We’ll build a parser to extract data from a github page.

>>> class GithubParser(Parser):
...    "Parse data from a github page"
...
...    @serialize
...    def title(self):
...        return str(self.h1.text.strip())

Now all we need is a github web page to parse.

>>> parser = GithubParser("https://github.com/tjwaterman99/soupstars")

Let’s see what we’ve got!

>>> parser.to_dict()
{'title': 'tjwaterman99/soupstars'}

You’re now ready to start building your own web parsers with soupstars. Nice job. 🍻

Going furtherΒΆ

We recommend starting with the examples.

ExamplesΒΆ

Soupstars includes various examples that you can reference or use directly.

NYTimesΒΆ

Extract article links and article metadata from nytimes.com

class soupstars.examples.nytimes.NytimesArticleParser(url)[source]ΒΆ

Parse attributes from a NY times article.

>>> from soupstars.examples.nytimes import NytimesArticleParser
title()[source]ΒΆ

The title of the article.

author()[source]ΒΆ

The author(s) of the article.

class soupstars.examples.nytimes.NytimesLinkParser(url)[source]ΒΆ

Parse the links from a NY times webpage.

Parameters:url (str) – The webpage to parse
>>> from soupstars.examples.nytimes import NytimesLinkParser

A list of links that point to NYTimes articles

non_article_links()[source]ΒΆ

A list of links that points to NYTimes pages that are not articles.

EconomistΒΆ

Extract metadata from economist index and article pages

class soupstars.examples.economist.WeeklyIndexPages(**kwargs)[source]ΒΆ

Example model for storing the results of the parser

class soupstars.examples.economist.WeeklyIndexPageParser(url)[source]ΒΆ

Parse metadata from the weekly updated index pages

ModelΒΆ

alias of WeeklyIndexPages

base_url()[source]ΒΆ

The url used

article_date()[source]ΒΆ

The date of the article

status_code()[source]ΒΆ

Status code of the request

num_articles()[source]ΒΆ

The number of articles foudn on the page

For more specific questions about using the library, refer to the api documentation.

APIΒΆ

The primary Parser class and serialize decorator are available from the models and serializers modules.

>>> from soupstars.models import Parser
>>> from soupstars.serializers import serialize

Those objects are also on the top-level api.

>>> from soupstars import Parser, serialize

ModelsΒΆ

The primary model provided by soupstars is the Parser class. It should generally be subclassed when building your own parsers.

When you initialize a parser with a url, it automatically downloads the webpage at that url and stores both the request and response as attributes.

>>> from soupstars import Parser, serialize
>>> class MyParser(Parser):
...     @serialize
...     def item(self):
...         return 'An item!'
>>> parser = MyParser('https://jsonplaceholder.typicode.com/todos/1')
>>> print(parser.response)
<Response [200]>
>>> print(parser.request)
<PreparedRequest [GET]>
class soupstars.models.Parser(url)[source]ΒΆ

Primary class for building parsers.

Parameters:url (str) – The url to parse
serializer_names()[source]ΒΆ

Returns a list of the names of the functions to be serialized.

serializer_functions()[source]ΒΆ

Returns a list of the functions to be serialized.

to_tuples()[source]ΒΆ

Returns a list of (name, value) tuples of each function to be serialized.

to_dict()[source]ΒΆ

Convert the parser to a dictionary, with keys the names of each serializer and values the value of each serializer

to_json()[source]ΒΆ

Convert the parser to a JSON object

SerializersΒΆ

Serializers help convert parsers into storable objects. The functions defined in this module are used to instruct soupstars about how to perform the serialization.

soupstars.serializers.serialize(function)[source]ΒΆ

Decorating a function defined on a parser with serialize instructs soupstars to include that function’s return value when building its own serialization.

>>> from soupstars import Parser, serialize
>>> class MyParser(Parser):
...     @serialize
...     def length(self):
...         return len(self.response.content)
...
>>> parser = MyParser('https://jsonplaceholder.typicode.com/todos/1')
>>> parser.serializer_names()
['length']
>>> 'length' in parser.to_dict()
True

MixinsΒΆ

Soupstars offers mixins for saving Parser objects to databases.

SQLAlchemy MixinsΒΆ

Mixins for saving parsers via SQLAlchemy

class soupstars.mixins.sqlalchemy_mixins.SqlalchemyMixin[source]ΒΆ

Use as a mixin on a Parser to save its data to a SQLAlchemy model

load_session()[source]ΒΆ

Loads a SQLAlchemy session object.

save()[source]ΒΆ

Saves the parser to the SQLAlchemy Model defined on the class