Scraping data with agents¶
Coding agents are remarkably good at web scraping. They can fetch pages, parse HTML, figure out pagination, and handle the messy edge cases that make scraping tedious. They can also drive a real browser to handle JavaScript-rendered pages.
Browser automation with rodney¶
Many modern websites render their content with JavaScript, which means simple HTTP requests won’t see the data. For these sites, we need browser automation.
rodney is a tool that lets coding agents control a Chrome browser - navigating to pages, reading their content, clicking buttons, and extracting data.
Start by telling your agent:
Run `uvx rodney --help` to learn how rodney works.
Exercise: ProPublica financial disclosures¶
ProPublica maintains a database of financial disclosures from President Trump and 1,500 of his appointees. Each appointee has a page showing their assets, employment history, investments, and outside positions.
Let’s try scraping an individual page. Pick an appointee - for example:
https://projects.propublica.org/trump-team-financial-disclosures/appointees/ursprung-sarah/
Tell your agent:
Use rodney to load https://projects.propublica.org/trump-team-financial-disclosures/appointees/ursprung-sarah/ and explore the structure of the page.
Watch what the agent does. It should launch a browser, navigate to the page, and start inspecting the DOM to understand how the data is structured.
Deciphering the underlying data¶
This site is built with SvelteKit, which fetches its data as JSON. If you look at the network requests in browser devtools, you’ll see it loads a URL like this:
The catch: this JSON doesn’t use obvious keys for the data. It’s a compact SvelteKit format that’s not immediately human-readable.
Tell your agent:
Fetch https://projects.propublica.org/trump-team-financial-disclosures/appointees/ursprung-sarah/__data.json?x-sveltekit-invalidated=01 using curl and save a copy, then look at it, then consider how that data maps to the information on the page - can you decipher the JSON data despite it lacking obvious keys?
This is a great demonstration of something agents excel at: reverse-engineering data formats. The agent can compare what it sees on the rendered page (via rodney) with the raw JSON and figure out which values map to which fields.
Here’s an example of what happens when you run these prompts.