Custom Python Logging For Scrapy

Scrapy is a great web scraping framework, but it lacks a good logging setup. In this short blog post, I'll show you how to use Structlog with Scrapy.

Note: I have an in depth blog post on Django development and production logging using Structlog, which you can check out if you want to learn more about logging with Structlog. The next section is just a brief introduction/configuration example on logging with Structlog, if you want an in depth explanation, check the blog post above. If you have logging configured skip to the next section.

Configuring Structlog and Logging

Install Structlog using pip/poetry:

poetry add structlog

Configure Structlog to your likes, here is an example configuration:

import structlog

shared_structlog_processors = [
    structlog.contextvars.merge_contextvars,
    structlog.stdlib.add_logger_name,
    structlog.stdlib.add_log_level,
    add_open_telemetry_spans,
    # Perform %-style formatting.
    structlog.stdlib.PositionalArgumentsFormatter(),
    # Add a timestamp in ISO 8601 format.
    structlog.processors.TimeStamper(fmt="iso"),
    structlog.processors.StackInfoRenderer(),
    # If some value is in bytes, decode it to a unicode str.
    structlog.processors.UnicodeDecoder(),
    # Add callsite parameters.
    structlog.processors.CallsiteParameterAdder(
        {
            structlog.processors.CallsiteParameter.FILENAME,
            structlog.processors.CallsiteParameter.FUNC_NAME,
            structlog.processors.CallsiteParameter.LINENO,
        }
    ),
]

base_structlog_formatter = [structlog.stdlib.ProcessorFormatter.wrap_for_formatter]

structlog.configure(
    processors=shared_structlog_processors + base_structlog_formatter,  # type: ignore
    logger_factory=structlog.stdlib.LoggerFactory(),
    wrapper_class=structlog.stdlib.BoundLogger,
    cache_logger_on_first_use=True,
)

Configuring Python logging with Structlog:

DJANGO_LOG_LEVEL = os.getenv("DJANGO_LOG_LEVEL", "INFO")
DJANGO_SCRAPY_LOG_LEVEL = os.getenv("DJANGO_SCRAPY_LOG_LEVEL", "INFO")
LOGGING = {
    "version": 1,
    "disable_existing_loggers": False,
    "formatters": {
        "colored_console": {
            "()": structlog.stdlib.ProcessorFormatter,
            "processor": structlog.dev.ConsoleRenderer(colors=True),
            "foreign_pre_chain": shared_structlog_processors,
        },
        "json_formatter": {
            "()": structlog.stdlib.ProcessorFormatter,
            "processor": structlog.processors.JSONRenderer(),
            "foreign_pre_chain": shared_structlog_processors,
        },
    },
    "handlers": {
        "console": {
            "class": "logging.StreamHandler",
            "formatter": "colored_console",
        },
        "json": {
            "class": "logging.StreamHandler",
            "formatter": "json_formatter",
        },
        "null": {
            "class": "logging.NullHandler",
        },
    },
    "root": {
        "handlers": ["console"],
        "level": "WARNING",
    },
    "loggers": {
        ...
        # Your project
        "your_project": {
            "level": DJANGO_LOG_LEVEL,
        },
        # Scrapy
        "scrapy": {
            "level": DJANGO_SCRAPY_LOG_LEVEL,
        },
    },
}

Now we want to use the above LOGGING config with Scrapy.

Configuring Scrapy

First, we'll disable the Scrapy root handler, and then we'll configure the Scrapy logger to use the Structlog logger. The Scrapy root logger is a problem because it adds another logging output, and it uses Scrapy's own configuration. We want to use our own Structlog logger. We use a generic function called run_scraper which creates a CrawlerProcess with our Scrapy settings and runs the process. In this function we want to disable the install_root_handler but also set dictConfig ensuring the previously configured Python LOGGING gets applied. Check the below example:

@celery_app.task(time_limit=time_limit, soft_time_limit=soft_time_limit)
def scrape_page_xyz():
    run_scraper(MySpider)


def run_scraper(scraper_cls):
    process = CrawlerProcess(get_project_settings(), install_root_handler=False) # Important
    dictConfig(settings.LOGGING) # Important
    process.crawl(scraper_cls)
    process.start()

Now we want to override the logger for each spider using a base class.

from scrapy.spiders import Spider as BaseScrapySpider

class BaseSpider(BaseScrapySpider):

    @property
    def logger(self):
        spider_logger = structlog.get_logger(self.name)
        return spider_logger.bind(spider=self)

Each spider should inherit from the BaseSpider class.

class Spider(BaseSpider):  # pylint: disable=abstract-method
    ...

Now you can use the logger in your spider like this:

class Spider(BaseSpider):
    def my_function(self, response):
        self.logger.info("Scraping page", url=response.url)

The above will use the Structlog logger instead of the default Scrapy logger.

This allows you to use the same logger for all your Python projects, which is a great advantage. You can also use the logger in your pipelines, middlewares, etc. The custom Scrapy logging annoyed me for quite the long time, and now I've found the solution.

Recently I worked on a Django project where I built a Django API using DRF, but I also was responsible for the infrastructure setup. Setting up a baseline Django project was easy due to the Django cookiecutter, but for the CI/CD setup the path was not as straight forward. Overtime, I've found a large ecosystem of tools that format, lint and test Django/Python projects very well. Getting a good CI suite with GitHub Actions was easy due to the ecosystem of formatting, linting and testing tools, but there was no blog post/guide on the all the various tools available. Therefore, I thought of sharing all the various actions we're using to ensure the highest code standard as possible using these tools.