Custom Python Logging For Scrapy

7 months ago 879 views
3 min read

Scrapy is a great web scraping framework, but it lacks a good logging setup. In this short blog post, I'll show you how to use Structlog with Scrapy.

Note: I have an in depth blog post on Django development and production logging using Structlog, which you can check out if you want to learn more about logging with Structlog. The next section is just a brief introduction/configuration example on logging with Structlog, if you want an in depth explanation, check the blog post above. If you have logging configured skip to the next section.

Configuring Structlog and Logging

Install Structlog using pip/poetry:

poetry add structlog

Configure Structlog to your likes, here is an example configuration:

import structlog

shared_structlog_processors = [
    structlog.contextvars.merge_contextvars,
    structlog.stdlib.add_logger_name,
    structlog.stdlib.add_log_level,
    add_open_telemetry_spans,
    # Perform %-style formatting.
    structlog.stdlib.PositionalArgumentsFormatter(),
    # Add a timestamp in ISO 8601 format.
    structlog.processors.TimeStamper(fmt="iso"),
    structlog.processors.StackInfoRenderer(),
    # If some value is in bytes, decode it to a unicode str.
    structlog.processors.UnicodeDecoder(),
    # Add callsite parameters.
    structlog.processors.CallsiteParameterAdder(
        {
            structlog.processors.CallsiteParameter.FILENAME,
            structlog.processors.CallsiteParameter.FUNC_NAME,
            structlog.processors.CallsiteParameter.LINENO,
        }
    ),
]

base_structlog_formatter = [structlog.stdlib.ProcessorFormatter.wrap_for_formatter]

structlog.configure(
    processors=shared_structlog_processors + base_structlog_formatter,  # type: ignore
    logger_factory=structlog.stdlib.LoggerFactory(),
    wrapper_class=structlog.stdlib.BoundLogger,
    cache_logger_on_first_use=True,
)

Configuring Python logging with Structlog:

DJANGO_LOG_LEVEL = os.getenv("DJANGO_LOG_LEVEL", "INFO")
DJANGO_SCRAPY_LOG_LEVEL = os.getenv("DJANGO_SCRAPY_LOG_LEVEL", "INFO")
LOGGING = {
    "version": 1,
    "disable_existing_loggers": False,
    "formatters": {
        "colored_console": {
            "()": structlog.stdlib.ProcessorFormatter,
            "processor": structlog.dev.ConsoleRenderer(colors=True),
            "foreign_pre_chain": shared_structlog_processors,
        },
        "json_formatter": {
            "()": structlog.stdlib.ProcessorFormatter,
            "processor": structlog.processors.JSONRenderer(),
            "foreign_pre_chain": shared_structlog_processors,
        },
    },
    "handlers": {
        "console": {
            "class": "logging.StreamHandler",
            "formatter": "colored_console",
        },
        "json": {
            "class": "logging.StreamHandler",
            "formatter": "json_formatter",
        },
        "null": {
            "class": "logging.NullHandler",
        },
    },
    "root": {
        "handlers": ["console"],
        "level": "WARNING",
    },
    "loggers": {
        ...
        # Your project
        "your_project": {
            "level": DJANGO_LOG_LEVEL,
        },
        # Scrapy
        "scrapy": {
            "level": DJANGO_SCRAPY_LOG_LEVEL,
        },
    },
}

Now we want to use the above LOGGING config with Scrapy.

Configuring Scrapy

First, we'll disable the Scrapy root handler, and then we'll configure the Scrapy logger to use the Structlog logger. The Scrapy root logger is a problem because it adds another logging output, and it uses Scrapy's own configuration. We want to use our own Structlog logger. We use a generic function called run_scraper which creates a CrawlerProcess with our Scrapy settings and runs the process. In this function we want to disable the install_root_handler but also set dictConfig ensuring the previously configured Python LOGGING gets applied. Check the below example:

@celery_app.task(time_limit=time_limit, soft_time_limit=soft_time_limit)
def scrape_page_xyz():
    run_scraper(MySpider)


def run_scraper(scraper_cls):
    process = CrawlerProcess(get_project_settings(), install_root_handler=False) # Important
    dictConfig(settings.LOGGING) # Important
    process.crawl(scraper_cls)
    process.start()

Now we want to override the logger for each spider using a base class.

from scrapy.spiders import Spider as BaseScrapySpider

class BaseSpider(BaseScrapySpider):

    @property
    def logger(self):
        spider_logger = structlog.get_logger(self.name)
        return spider_logger.bind(spider=self)

Each spider should inherit from the BaseSpider class.

class Spider(BaseSpider):  # pylint: disable=abstract-method
    ...

Now you can use the logger in your spider like this:

class Spider(BaseSpider):
    def my_function(self, response):
        self.logger.info("Scraping page", url=response.url)

The above will use the Structlog logger instead of the default Scrapy logger.

This allows you to use the same logger for all your Python projects, which is a great advantage. You can also use the logger in your pipelines, middlewares, etc. The custom Scrapy logging annoyed me for quite the long time, and now I've found the solution.


Similar Posts

Django Development and Production Logging

12 min read

Django comes with many great built-in features, logging is one of them. Pre-configured logging setups that vary between development (debug mode) and production environments. Easy integration to send you emails on errors, out-of-the-box support for various settings as log levels …


Core Continuous Integration (CI) Steps for Python and Django Applications

8 min read

Recently I worked on a Django project where I built a Django API using DRF, but I also was responsible for the infrastructure setup. Setting up a baseline Django project was easy due to the Django cookiecutter, but for the …


Kickstarting Infrastructure for Django Applications with Terraform

8 min read

When creating Django applications or using cookiecutters as Django Cookiecutter you will have by default a number of dependencies that will be needed to be created as a S3 bucket, a Postgres Database and a Mailgun domain.