Scrapy is a great web scraping framework, but it lacks a good logging setup. In this short blog post, I'll show you how to use Structlog with Scrapy.
Note: I have an in depth blog post on Django development and production logging using Structlog, which you can check out if you want to learn more about logging with Structlog. The next section is just a brief introduction/configuration example on logging with Structlog, if you want an in depth explanation, check the blog post above. If you have logging configured skip to the next section.
Configuring Structlog and Logging
Install Structlog using pip/poetry:
poetry add structlog
Configure Structlog to your likes, here is an example configuration:
import structlog
shared_structlog_processors = [
structlog.contextvars.merge_contextvars,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
add_open_telemetry_spans,
# Perform %-style formatting.
structlog.stdlib.PositionalArgumentsFormatter(),
# Add a timestamp in ISO 8601 format.
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
# If some value is in bytes, decode it to a unicode str.
structlog.processors.UnicodeDecoder(),
# Add callsite parameters.
structlog.processors.CallsiteParameterAdder(
{
structlog.processors.CallsiteParameter.FILENAME,
structlog.processors.CallsiteParameter.FUNC_NAME,
structlog.processors.CallsiteParameter.LINENO,
}
),
]
base_structlog_formatter = [structlog.stdlib.ProcessorFormatter.wrap_for_formatter]
structlog.configure(
processors=shared_structlog_processors + base_structlog_formatter, # type: ignore
logger_factory=structlog.stdlib.LoggerFactory(),
wrapper_class=structlog.stdlib.BoundLogger,
cache_logger_on_first_use=True,
)
Configuring Python logging with Structlog:
DJANGO_LOG_LEVEL = os.getenv("DJANGO_LOG_LEVEL", "INFO")
DJANGO_SCRAPY_LOG_LEVEL = os.getenv("DJANGO_SCRAPY_LOG_LEVEL", "INFO")
LOGGING = {
"version": 1,
"disable_existing_loggers": False,
"formatters": {
"colored_console": {
"()": structlog.stdlib.ProcessorFormatter,
"processor": structlog.dev.ConsoleRenderer(colors=True),
"foreign_pre_chain": shared_structlog_processors,
},
"json_formatter": {
"()": structlog.stdlib.ProcessorFormatter,
"processor": structlog.processors.JSONRenderer(),
"foreign_pre_chain": shared_structlog_processors,
},
},
"handlers": {
"console": {
"class": "logging.StreamHandler",
"formatter": "colored_console",
},
"json": {
"class": "logging.StreamHandler",
"formatter": "json_formatter",
},
"null": {
"class": "logging.NullHandler",
},
},
"root": {
"handlers": ["console"],
"level": "WARNING",
},
"loggers": {
...
# Your project
"your_project": {
"level": DJANGO_LOG_LEVEL,
},
# Scrapy
"scrapy": {
"level": DJANGO_SCRAPY_LOG_LEVEL,
},
},
}
Now we want to use the above LOGGING
config with Scrapy.
Configuring Scrapy
First, we'll disable the Scrapy root handler, and then we'll configure the Scrapy logger to use the Structlog logger. The Scrapy root logger is a problem because it adds another logging output, and it uses Scrapy's own configuration. We want to use our own Structlog logger. We use a generic function called run_scraper
which creates a CrawlerProcess
with our Scrapy settings and runs the process. In this function we want to disable the install_root_handler
but also set dictConfig
ensuring the previously configured Python LOGGING
gets applied. Check the below example:
@celery_app.task(time_limit=time_limit, soft_time_limit=soft_time_limit)
def scrape_page_xyz():
run_scraper(MySpider)
def run_scraper(scraper_cls):
process = CrawlerProcess(get_project_settings(), install_root_handler=False) # Important
dictConfig(settings.LOGGING) # Important
process.crawl(scraper_cls)
process.start()
Now we want to override the logger
for each spider using a base class.
from scrapy.spiders import Spider as BaseScrapySpider
class BaseSpider(BaseScrapySpider):
@property
def logger(self):
spider_logger = structlog.get_logger(self.name)
return spider_logger.bind(spider=self)
Each spider should inherit from the BaseSpider
class.
class Spider(BaseSpider): # pylint: disable=abstract-method
...
Now you can use the logger in your spider like this:
class Spider(BaseSpider):
def my_function(self, response):
self.logger.info("Scraping page", url=response.url)
The above will use the Structlog logger instead of the default Scrapy logger.
This allows you to use the same logger for all your Python projects, which is a great advantage. You can also use the logger in your pipelines, middlewares, etc. The custom Scrapy logging annoyed me for quite the long time, and now I've found the solution.