Cleaning data: decoding neighborhood codes¶
Let’s go back to our San Francisco trees database. If you’ve been exploring the data, you’ll have noticed that the neighborhood columns - Analysis Neighborhoods and Neighborhoods (old) - contain mysterious numeric codes instead of actual neighborhood names.
For example, you might see 35 instead of “Sunset/Parkside” or 20 instead of “Mission”. The Supervisor Districts column has the same problem - and to make things worse, the numeric codes don’t even match the actual district numbers.
This is a common data cleaning problem: the dataset uses internal identifiers that only make sense if you have access to a separate lookup table. Let’s see if a coding agent can figure this out.
Research the codes¶
Give your agent this prompt:
The trees.db database has columns called "Analysis Neighborhoods" and "Supervisor Districts" that contain numeric codes instead of names. This data comes from the San Francisco Open Data portal. Can you figure out what these codes mean and add the human-readable names to the database?
Watch what the agent does. This is a great example of a task where coding agents really shine - a human would need to dig through data portal documentation, find the right boundary datasets, and figure out the mapping. An agent can do all of that research and then apply the fix in one go.
How the codes work (spoiler)
The numeric codes are Socrata _feature_id values. When the dataset was published on the SF Open Data portal, Socrata performed a spatial join against boundary datasets, assigning each tree an ID based on its latitude/longitude.
The lookup tables can be fetched from the Socrata API:
Analysis Neighborhoods:
https://data.sfgov.org/resource/ajp5-b2md.json?$select=_feature_id,nhood&$limit=50Supervisor Districts:
https://data.sfgov.org/resource/rxqg-mtj9.json?$select=_feature_id,supervisor,supname&$limit=15
The Supervisor Districts codes are especially tricky - _feature_id 1 is actually District 2, _feature_id 10 is District 3, etc. The IDs don’t match the district numbers at all.
Verify the results¶
Once your agent has added neighborhood names, browse the updated database:
uvx datasette trees.db
You should now be able to see actual neighborhood names like “Sunset/Parkside”, “Mission”, and “Bayview Hunters Point” instead of opaque numeric codes.
Try re-running some of the analysis questions from earlier - they’ll be much more meaningful now:
Which neighborhood has the most trees per block?
Which neighborhoods get the most city (DPW) maintained trees vs. relying on private caretakers?
Which neighborhoods have had the most recent tree plantings? Which seem neglected?
Do some neighborhoods have greater species diversity than others?