chaining error-prone steps: webscraping and monads
a lot of code starts out pretty clean, and then ends up looking like a soup of defensive programming and try/except blocks.
i’m going to try to illustrate how monads fall out as a clean solution to this problem, without any category theory or other math jargon
here’s a web-scraper for a data pipeline:
import requests, json
from bs4 import BeautifulSoup
def scrape_movie_data(movie_id):
# get the movie page
try:
response = requests.get(f"https://moviedb.com/movie/{movie_id}")
if response.status_code != 200:
return None, f"HTTP {response.status_code}"
except requests.exceptions.ConnectionError:
return None, "No internet connection"
# parse the title
try:
soup = BeautifulSoup(response.text, 'html.parser') # BeautifulSoup rarely fails, but i'm trying to make a point here
title = soup.find('h1', class_='movie-title').text.strip()
except:
return None, "Failed to parse title"
# get the rating page
try:
rating_response = requests.get(f"https://moviedb.com/movie/{movie_id}/ratings")
if rating_response.status_code != 200:
return None, f"Rating page HTTP error"
except requests.exceptions.ConnectionError:
return None, "No internet connection for ratings"
# parse the rating
try:
rating_soup = BeautifulSoup(rating_response.text, 'html.parser')
rating = float(rating_soup.find('span', class_='avg-rating').text)
except:
return None, "Failed to parse rating"
return {'title': title, 'rating': rating}, None
there’s a ton of different potential error cases we’re defending against. failing at the network request level because i’m not connected to wifi, failing authentication, page structure variance causing the parsing/extraction to fail, etc
this makes the code pretty messy for what should be a relatively simple/readable data pipeline
notice the pattern i keep having to write:
- try to do something
- check if it failed
- return an error if it did
- thread into the next step, or short-circuit on failure
try:
result = some_operation()
if something_went_wrong:
return None, "error message"
except Exception:
return None, "different error message"
i don’t want all these try/excepts clogging up the readability of my code.
i’d like to write something like
def scrape_movie_data(movie_id: str):
movie_url = f"https://moviedb.com/movie/{movie_id}"
rating_url = f"https://moviedb.com/movie/{movie_id}/ratings"
response = fetch_url(movie_url)
soup = parse_html(response.text)
...
way cleaner! but there’s a problem. what happens if fetch_url returns an error string?
response = fetch_url("https://moviedb.com/movie/12345 ")
# response = "No internet connection"
soup = parse_html(response.text) # fails
now parse_html(response.text)
fails.
i’d like a nice way to chain these operations together, maybe something like
def maybe_chain(fn, arg):
if arg is None: # treat None as "error"
return None
try:
return fn(arg)
except:
return None
now i can write
def scrape_movie_data(movie_id: str):
movie_url = f"https://moviedb.com/movie/{movie_id}"
def safe_get(url):
try:
response = requests.get(url)
if response.status_code != 200:
return None
return response.text
except:
return None
def safe_parse(html):
return BeautifulSoup(html, 'html.parser')
def safe_extract_title(soup):
return soup.find('h1', class_='movie-title').text.strip()
html = safe_get(movie_url)
soup = maybe_chain(safe_parse, html)
title = maybe_chain(safe_extract_title, soup)
return title
but now i lose information about what actually went wrong at each step! was it a network error? a missing element in a page? with None, we can’t easily find out.
perhaps there’s a better way to represent “failure” than using none. we need a way to carry both success values AND error messages through our pipeline.
here’s an idea: what if we could tag a result as “success” or “fail”?
so instead of returning something like '<html>'
we’d return Success('<html>')
.
if we do this, we could write the function so that it automatically short-circuits and passes through on failure.
from typing import Union, Callable
from dataclasses import dataclass
@dataclass
class Success:
value: object
@dataclass
class Error:
message: str
now we can return either success or error with each function:
def fetch_url(url: str) -> Union[Success, Error]:
try:
response = requests.get(url)
if response.status_code != 200:
return Error(f"HTTP {response.status_code}")
return Success(response.text)
except requests.exceptions.ConnectionError:
return Error("No internet connection")
except Exception as e:
return Error(f"Request failed: {e}")
def parse_html(html: str) -> Union[Success, Error]:
try:
soup = BeautifulSoup(html, 'html.parser')
return Success(soup)
except Exception as e:
return Error(f"HTML parsing failed: {e}")
def extract_title(soup) -> Union[Success, Error]:
try:
title = soup.find('h1', class_='movie-title').text.strip()
return Success(title)
except Exception as e:
return Error(f"Title extraction failed: {e}")
cleaner, and we can see which step failed and why.
but i still have to pepper isinstance
through my code, and we’re kind of still doing the same thing as before:
- check if previous result was error
- if yes, return error
- else, unwrap value from error and pass to next function
what we’re really trying to abstract out is something like
if isinstance(result, Error):
return result
else:
# extract the value and do something with it
which can chain
the operations like parse_html
and extract_title
. so we’ll write a chain
function to do it.
this might look like
fetch_url(movie_url).chain(parse_html).chain(extract_title)
but then we’d have to give our Success/Error
classes methods
what if we could write a pipeline
function that could automatically build our chain of functions?
def pipeline(initial_result: Any, *functions: List[Callable]) -> Union[Success, Error]:
result = initial_result
for func in functions:
result = chain(func, result)
return result
def scrape_movie_data(movie_id: str) -> Union[Success, Error]:
movie_url = f"https://moviedb.com/movie/{movie_id}"
return pipeline(
fetch_url(movie_url),
parse_html,
extract_title
)
but wait, if chain
is the one peeling off the Success/Error wrapper, what signature should the next function actually have?
we need to decide on a convention. will our data pipeline functions all operate in terms of Union[Success, Error]
or be agnostic to the concept?
it makes sense to let the programmer decide if a given function succeeded or failed. as a trivial example, a divide
function knows that dividing by zero should be an Error
. but it shouldn’t have to do the manual job of unwrapping its values
so since each of our work functions take unwrapped values (like a plain string) and return wrapped Result
values, we may as well define it as Result = Union[Success, Error]
. you might recognize this pattern from Rust or other languages
Result = Union[Success, Error]
and refactor our work functions accordingly
def parse_html(html: str) -> Result:
try:
soup = BeautifulSoup(html, 'html.parser')
return Success(soup)
except Exception as e:
return Error(f"HTML parsing failed: {e}")
now chain
can just unwrap values, and the function will have the responsibility of defining how to wrap the result
def chain(func: Callable, wrapped_value: Result) -> Result:
if isinstance(wrapped_value, Error):
return wrapped_value # short-circuit: pass error through
else:
# unwrap the value and apply the function
# the function will return a new Result (Success or Error)
return func(wrapped_value.value)
which allows us to write something like:
def scrape_movie_data(movie_id: str) -> Result:
movie_url = f"https://moviedb.com/movie/{movie_id}"
return pipeline(
fetch_url(movie_url),
parse_html,
extract_title
)
let’s trace through the code:
# if everything works:
html_result = fetch_url("https://moviedb.com/movie/12345")
# html_result = Success("<html>...")
soup_result = chain(parse_html, html_result)
# chain() unwraps html_result.value = "<html>..."
# calls parse_html("<html>...")
# parse_html returns Success(BeautifulSoup object)
# soup_result = Success(<soup>)
title_result = chain(extract_title, soup_result)
# chain() unwraps soup_result.value = <soup>
# calls extract_title(<soup>)
# extract_title returns Success("The Matrix")
# title_result = Success("The Matrix")
and if there’s an error:
# If network fails:
html_result = fetch_url("https://moviedb.com/movie/12345")
# html_result = Error("No internet connection")
soup_result = chain(parse_html, html_result)
# chain() sees html_result is Error
# returns Error("No internet connection") without calling parse_html
# soup_result = Error("No internet connection")
title_result = chain(extract_title, soup_result)
# chain() sees soup_result is Error
# returns Error("No internet connection") without calling extract_title
# title_result = Error("No internet connection")
this pattern we’ve discovered - wrapping values in Success/Error and chaining operations that might fail - is actually a well-studied concept in functional programming.
being able to wrap values, and chain them in a predictable way (following associativity/identity) means that our error handling will compose well no matter how we nest or combine operations. this is pattern is a monad!
what we called chain
is traditionally called bind
, and this Success/Error pattern is known as the Result monad (or Either monad in some languages).
(knowing these names helps when you want to learn more or use existing libraries that implement these patterns)
let’s rename our function to match that convention:
def bind(func: Callable, result: Result) -> Result:
match result: # if you have python 3.10 this is nicer than isinstance()
case Error():
return result
case Success(value):
return func(value)
and our scraper becomes:
def scrape_movie_data(movie_id: str) -> Result:
movie_url = f"https://moviedb.com/movie/{movie_id}"
rating_url = f"https://moviedb.com/movie/{movie_id}/ratings"
title_result = pipeline(
fetch_url(movie_url),
parse_html,
extract_title
)
rating_result = pipeline(
fetch_url(rating_url),
parse_html,
extract_rating
)
return (title_result, rating_result)
and our caller can unwrap the values as needed
match scrape_movie_data("tt0133093")[0]:
case Success(title):
print(f"Got movie: {title}")
case Error(msg):
print(f"Scraping failed: {msg}")
we started with error-handling code drowning in repetitive try-except blocks, making it hard to read.
but by separating the concerns of computation/core business logic (like fetch/parse/extract), and error propagation logic (check for errors and short circ)), we realized we could factor out the boilerplate of unwrapping/checking for errors with bind
and Result
now our error handling happens invisibly in the background, while still propagating Errors if they occur, and our business logic stays clean
this Result pattern we’ve built is just one example of a monad - a general pattern for chaining operations that have some “context” (in our case, the context is “might fail”).
the same bind pattern works for other contexts too:
List
monad: operations that might return multiple valuesMaybe/Option
monad: operations that might return nothing (simpler than Result - no error message)IO
monad: operations which might have side effects
note that you don’t actually have to reimplement this in python in the way described in the blog post, that was just to illustrate the concept. the Maybe monad (might return nothing) is actually already implemented in python through Optional/None, and languages like Rust, Haskell, etc make monads first class citizens in their standard library