US Business
All currently-registered US business entities — SEC EDGAR public companies (10K w/ ticker, 1M+ historical CIKs), IRS exempt orgs (1.95M nonprofits), state SOS filings (NY 20.5M free), GLEIF LEI (349K US, CC0), SAM.gov vendors, PPP recipients (5M small biz).
Live mix at 60K total: 9K SEC public companies (with ticker + exchange) + 9K NY DOS recent formations (Socrata live) + 9K SBA PPP recipients + 3K GLEIF LEI + 30K nonprofits across all 14 downloaded IRS EO states (CA/TX/NY/FL/IL/PA/OH/GA/MI/NC/NJ/MA/VA/WA). Detail page at `/business/entities/[id]` cross-links via clusters when available.
Named officers, directors and key employees from public US filings — SEC EDGAR Form 4 (insider transaction filings, public companies) + IRS Form 990 Part VII Section A (compensation table for nonprofits, top-paid first). Names + titles + companies + compensation. No emails — pair with company website / Apollo for outreach.
Linked-entity clusters where the same legal entity appears across 2+ sources — joined via shared CIK / EIN / LEI / UEI or (normalized name + state) fallback. 1,000 clusters retained (member_count ≥ 2) from union-find over 6 datasets. Click a cluster to see all source records side-by-side.
Active NY State corporations / LLCs / LPs / nonprofits via Socrata `n9v6-gdp6` (data.ny.gov, free, anonymous, near-realtime). 50K rows of the canonical one-row-per-entity registry across 5 entity_kinds. Full corpus 20.5M filings via the per-event 63wc-4exh dataset.
Florida corp/LLC/LP filings via free SFTP (Public/PubAccess1845!). 60-row synthetic fixture spanning Miami/Orlando/Tampa/Jacksonville/Tallahassee. Full ingest path stubbed in `ingest/fl_sunbiz.py` with the 1440-byte fixed-width field-offset map encoded.
10K unique borrowers sampled from SBA `public_150k_plus` FOIA (968K loans / 863K unique). Top 4K by initial amount + 6K evenly-spaced mid-tier. 54 states/territories. Full deduped corpus (863,501 rows) on NAS at `staging/ppp_unique.parquet` (24 MB zstd).
500 US LEI records sampled across 174 distinct EntityLegalForm codes and all 50 states + DC, from the GLEIF Golden Copy (3.3M global / 349K US, CC0). Streamed from the lei2 zip; relationship records (parent LEI) live in the separate rr_latest.zip and are not yet joined.
60-row synthetic fixture of federal vendors with realistic UEI / CAGE / NAICS across DC/VA/MD/CA/TX/FL/GA. Full ingest path stubbed in `ingest/sam_gov.py` for the SAM_PUBLIC_MONTHLY_V2 monthly extract — gated on a free api.data.gov key.
1,351 top federal-contract recipients aggregated from USAspending.gov spending_by_award API (5K award rows → 1.35K unique UEI). Lockheed Martin $322B, Electric Boat $141B at the top. DoD-dominated; 47 states represented. Free, no auth.
220 latest-FY financial snapshots (revenue / net income / assets / employees) for popular tickers from `data.sec.gov/api/xbrl/companyfacts/`. Apple $416B, Amazon $717B, MSFT $282B captured. Full nightly companyfacts.zip (~10GB) wired in `ingest/sec_companyfacts.py`.
200-row synthetic preview of trademark + patent holders with realistic class distribution. IBM 1180 TMs / 110K patents at the top. Full bulk path stubbed in `ingest/uspto_tm.py` + `ingest/uspto_patent.py` against USPTO Open Data Portal CSV archives.
Verified bulk-download paths
- SEC EDGARlive —
company_tickers_exchange.json(10K active public companies w/ ticker + exchange) andcik-lookup-data.txt(1M+ historical CIKs). Free, daily, no auth. - IRS EO BMFlive — per-state CSV at
irs.gov/pub/irs-soi/eo_<state>.csv(1.95M nonprofits total, monthly). - NY State Open Datalive — Socrata API
data.ny.gov/api/views/63wc-4exh/rows.csv— 20.5M corp filings, free, near-realtime. Best free state SOS bulk in the country. - FL Sunbizstub — public SFTP
sftp.floridados.gov(Public/PubAccess1845!), 10M+ entities, daily delta + quarterly full. - SBA PPPlive — 13 CSVs at
data.sba.gov/dataset/ppp-foia, ~5GB, 11.4M loans / ~5M unique businesses. - GLEIF LEIlive —
goldencopy.gleif.org/api/v2/golden-copies/publishes/lei2/latest.csv, 3.3M global / 349K US, CC0, daily. - SAM.govstub — monthly
SAM_PUBLIC_MONTHLY_V2_*.ZIP, ~700K registered vendors with UEI/CAGE/NAICS. Needs free api.data.gov key.
Skipped (paid / locked): OpenCorporates (£12K+/yr), CA SOS bulk ($100/snapshot), TX SOS ($1,350+), DE Division of Corporations (no bulk).