Data sources
Provenance & refresh cadence
Data Sources
Every public source Field Risk Atlas ingests, with vintage, license / attribution requirements, and known issues. All sources are public; no proprietary or restricted data is used.
Each source has an EXPECTED_SHA256 constant pinned in its ingest module — if a source's bytes change upstream, ingest fails loudly and a maintainer must review before updating the constant.
Phase 1 — Basin and regulatory layers
B118 Groundwater Basins (i08)
- Module:
src/sgma_risk/ingest/b118.py - Source: California Natural Resources Agency Open Data — i08 B118 CA GroundwaterBasins. ArcGIS Hub item
49807a1fbc584631bdf88d9ca71dd083. - Vintage: Most-recent DWR release (2025-10 schema; carries
basin_idnatively). - Format: Shapefile ZIP via
gis.data.cnra.ca.gov/api/download/v1/items/.../shapefile. - License: Public — California state government data on the CNRA Open Data portal.
- Notes: 515 features. Schema changed in the 2025 release:
basin_idis now a native column instead of having to be derived fromBasin_Subb.
SGMA Basin Prioritization (2019)
- Module:
src/sgma_risk/ingest/priority.py - Source: CNRA Open Data — SGMA Basin Prioritization Dashboard CSV.
- Vintage: Final 2019 prioritization, released 2019-12-06.
- Format: CSV.
- License: Public CNRA data.
- Notes: 515 rows, joins to B118 by
Basin_Subbasin_Number. Distribution: Very Low 410 / Medium 48 / High 46 / Low 11.
Critically Overdrafted Basins (i08 COD)
- Module:
src/sgma_risk/ingest/cod.py - Source: CNRA Open Data — i08 CriticallyOverdraftedBasins. Item
15129538aec84617ba066d1fb14d4fd1. - Vintage: Updated 2022-12.
- Format: Shapefile ZIP.
- License: Public CNRA data.
- Notes: 21 critically overdrafted basins.
GSP Areas (i03)
- Module:
src/sgma_risk/ingest/gsp.py - Source: CNRA Open Data — i03 Groundwater Sustainability Plan Areas. Item
6a14bba494544d37b5c032ca9826435a. - Vintage: DWR-maintained, monthly refresh.
- Format: Shapefile ZIP.
- License: Public CNRA data.
- Notes: 121 GSP polygons across 92 unique basins. Joined at runtime to
data/reference/gsp_status_crosswalk.csv(manually maintained, see below) to surface SWRCB / DWR enforcement statuses that the i03 layer doesn't reflect.
GSP Status Crosswalk (manually maintained)
- File:
data/reference/gsp_status_crosswalk.csv - Source: Manually curated, reflecting May 2026 enforcement state.
- Vintage: 2026-05-08.
- Notes: Encodes Tule (probationary), Tulare Lake (probationary), Kaweah / Kern County / Chowchilla (returned to DWR), and Pleasant Valley (inadequate). Refresh monthly as enforcement actions move.
Phase 2 — Crop, parcel, well, water-district layers
Water Districts (i03 WaterDistricts)
- Module:
src/sgma_risk/ingest/water_districts.py - Source: CNRA Open Data — i03 WaterDistricts. Item
538bfc2e4ad64ff78ff18fbb8ca36033. - Vintage: Maintained by DWR / Atlas.
- Format: Shapefile ZIP.
- License: Public CNRA data.
- Notes: 4,021 polygons covering CVP contractors, SWP contractors, agricultural water districts, urban districts, and wholesalers. The schema does NOT include a district-type field — categorization (CVP vs SWP vs local vs urban) lives in
data/reference/water_tier_crosswalk.csv(built downstream after parcels narrow the scope).
Land IQ Statewide Crop Mapping (2023 final)
- Module:
src/sgma_risk/ingest/landiq.py - Source: CNRA Open Data — Statewide Crop Mapping, 2023 finalized release.
- Vintage: WY 2023, finalized; 2024 still provisional as of 2026-05-08.
- Format: Shapefile ZIP (
i15_Crop_Mapping_Final_2023.zip). - License: Public DWR data.
- Notes: 446K statewide ag fields → 134K filtered to v1 counties. Schema:
CLASS1= winter crop,CLASS2= summer crop (perennials populate slot 2 only).SYMB_CLASSis the always-populated single-letter category. Phase 3etl/crop_classifier.pydefines the canonicalcrop_classenum mapping.
USDA NASS Cropland Data Layer (CDL 2024)
- Module:
src/sgma_risk/ingest/cdl.py - Source: USDA NASS via the GMU CropScape state-FIPS mirror — CDL_2024_06.tif (CA-only, 30m).
- Vintage: 2024 release.
- Format: GeoTIFF (state-clipped; the national 1.6 GB / 9 GB releases would also work).
- License: Public USDA data.
- Notes: Used as fallback for the ~20% of parcels Land IQ doesn't cover. Stored at
data/interim/cdl.tifclipped to a 5 km buffered bbox of v1 ag activity, reprojected from EPSG:5070 to EPSG:3310.
DWR i07 Well Completion Reports
- Module:
src/sgma_risk/ingest/wells.py - Source: CNRA Open Data — i07 WellCompletionReports CSV (OSWCR-derived). Item
c074ca40fd684e41babd776eebefd009. - Vintage: Continuously updated; pinned by SHA at ingest time.
- Format: CSV.
- License: Public DWR data.
- Notes: 1.1M statewide rows → 196K filtered to v1 counties → 11,952 PLSS sections after aggregation. Aggregated by
(BaselineMeridian, Township, Range, Section). Median depth in v1 counties: ~340 ft. DWR's own metadata flags known issues with missing/duplicate records and incorrect values; aggregation to section median is robust to outliers.
Parcels (per-county)
The build plan called for trying a statewide composite first. The only candidate (CA_State_Parcels FeatureServer at services2.arcgis.com/zr3KAIbsRSUyARHG) is purpose-built for earthquake hazard disclosure and has Carahsoft-licensed underlying data — not fit for an open-source release. All 6 v1 counties fall back to per-county sources:
| County | URL | APN field | Count | License notes |
|---|---|---|---|---|
| Sonoma | socogis.sonomacounty.ca.gov/.../Sonoma_County_Parcels/FeatureServer/0 | APN | ~189K total | CC BY-SA 3.0 |
| Tulare | services2.arcgis.com/bYBANhmQGwSSLC0l/.../Public_Parcel_Search/FeatureServer/2 | PARCELID (the integer APN column drops leading zeros) | 164K | Public domain — capabilities Query,Extract |
| Kern | services5.arcgis.com/Y8jwjGUWbRjuqpG5/.../Assessor_Parcels_Land_2025/FeatureServer/0 | APN_LABEL (dashed) | 422K | Attribution required: "Kern County Assessor's Office, Mapping Section; Kern Council of Governments; MCAG; City of Bakersfield; City of Shafter." Disclaimer requires fetch-at-runtime use, no committed redistribution |
| Madera | gis.maderacounty.com/server/rest/services/CountyParcels/FeatureServer/0 | Parcel (dashed) | 68K | Public record (Cal. Gov. Code §6253). Server has a broken intermediate-cert chain; ingest uses verify_ssl=False |
| Fresno | services3.arcgis.com/ibgDyuD2DLBge82s/.../REGIONAL_PARCELS_VW/FeatureServer/11 | APN | 314K | Public — no formal license declared |
| Kings | services3.arcgis.com/24gLq1DBBzDfd0cZ/.../Parcels/FeatureServer/0 | APN | 50K | "Departmental and Community use" — public via the Community Development Agency tenant |
Key conventions:
- All county servers expose
outSR=3310so we get California Albers polygons regardless of the source projection (Sonoma 4326, Tulare 2228, Kern 3857, Madera 2227, Fresno 2228, Kings 3857). - All have
maxRecordCount=2000, so the same paginated query loop works. - Raw bytes are cached under
data/raw/parcels/{county}.geojson(gitignored). Final canonical store isdata/interim/parcels.parquetfiltered to ≥1 acre. - We never commit raw parcel data to the repo — only the SHA-pinned cache metadata. Per Kern's attribution disclaimer this is the correct posture; the others are also fine fetch-at-runtime.
Phase 3.5 — USDM drought weeks
US Drought Monitor (per-county weekly time series)
- Module:
src/sgma_risk/ingest/usdm.py - Source: USDM data services API —
usdmdataservices.unl.edu/api/CountyStatistics/GetDroughtSeverityStatisticsByAreaPercent. Per-county weekly CSV of % area at each severity (None / D0 / D1 / D2 / D3 / D4), cumulative format (statisticsType=1, where D2 = % area at D2 OR worse). - Vintage: Snapshot date pinned at module level (
SNAPSHOT_DATE = 2026-05-05). USDM publishes Thursdays for the prior Tuesday's analysis. Bump the constant to refresh. - Format: CSV via API.
- License: Public — National Drought Mitigation Center, USDA, NOAA.
- Notes: v1 is per-county (drought is fairly homogeneous within these counties for SGMA-screening use; per-parcel gridded sampling is a future upgrade if intra-county variation matters). Threshold: a "drought week" is one where ≥50% of county area was at D2+ severity. Output
data/interim/usdm.parquetis keyed oncountywithusdm_d2_weeks_52w(int count over trailing 52-week window). All 6 v1 counties are at 0 weeks for the May-2026 snapshot — California is in a wet period after the 2020-22 drought; column is non-discriminating in v1 but accurate. Refresh ahead of any meaningful score regeneration when conditions change.
Phase 3 — PLSS section polygons (overlay precursor)
BLM California CadNSDI — First Division Section Polygon
- Module:
src/sgma_risk/ingest/plss.py - Source: BLM California State Office, CadNSDI v3.1 —
gis.blm.gov/caarcgis/rest/services/lands/BLM_CA_CADNSDI/FeatureServer/2. - Vintage: Continuously maintained by BLM (per-feature
SOURCEDATEavailable). - Format: ArcGIS REST FeatureServer; bbox-filtered to the v1 6-county envelope at fetch time.
- License: Public — BLM National PLSS Cadastral data.
- Notes: 141,873 statewide → ~57K returned for the v1 bbox → 52,567 parsed sections after filtering to
FRSTDIVTYP='SN'and dropping rancho-grant polygons (meridian code 27, Sonoma area, ~6.5% — these don't have T/R/S coords by design and won't have well stats). Required for Phase 3 to spatial-join the per-section wells aggregates onto parcels. Field schema:PLSSIDis a positional code we parse into normalizedmeridian / township / rangekeys;FRSTDIVNOis the section number. Both meridian and T/R encodings normalize to canonical short codes (M/S/H + unpadded number + direction) so they match the wells-side keys afteretl/overlay.pynormalization.
Phase 3 outputs
Canonical parcel store
- Path:
data/processed/parcels.parquet(gitignored). - Built by:
src/sgma_risk/etl/overlay.py(run viauv run sgma-risk overlayor.\tasks.ps1 overlay). - Schema: 33 columns; the contract is
CANONICAL_COLUMNSinsrc/sgma_risk/etl/overlay.py. Includes a non-canonicalcrop_sourceaudit field (landiq/cdl/none) used by the Land IQ provenance test. Score-related fields are filled bysgma-risk score(Option A + B). - Rows: ~300K parcels post-dedup (one row per
parcel_idafter collapsing the 109 Fresno/Kings APN collisions to the largest geometry).
Reference CSVs (manually maintained, in data/reference/)
gsp_status_crosswalk.csv— May 2026 GSP enforcement state (Phase 1, in repo).water_tier_crosswalk.csv— district → ASFMRA Tier 1–4 (Phase 3, skeleton auto-generated by the first overlay run with one row per touched district; user fills intiervalues manually). 505 districts in the v1 area as of 2026-05-09.crop_class_crosswalk.csv— Land IQ MAIN_CROP / SYMB_CLASS code → canonicalcrop_classenum (Phase 3, in repo). Includes SYMB_CLASS-level fallback rows so unmapped MAIN_CROP codes still classify defensibly.cdl_class_lookup.csv— CDL integer code → crop name → crop_class (Phase 3, in repo).
Refresh cadence
| Source | Refresh | Why |
|---|---|---|
| GSP status crosswalk | Monthly | Enforcement state changes (SWRCB orders, DWR determinations) |
| Land IQ | Annually | New finalized year ~Q1; until then prior year's final |
| CDL | Annually | USDA releases ~Feb |
| All others | Re-validate SHAs quarterly | Source-side schema drift catches early |
What is NOT in this project
- No proprietary, restricted, or organization-internal data, ever.
- No commercial parcel aggregators (ParcelQuest, Regrid, Acres GIS) — these have license restrictions incompatible with an open-source repo.
- No appraisal-grade data.