ThePengLaw · Data explorer

Multi-source staging for the law-firm CRM pipeline. Each source is a data vertical (LinkedIn profiles & jobs, US-wide CPA licensees, …); each card under it is a dataset with its own browser.

LinkedIn

3 live · 0 stub · 0 planned

Profiles, companies, and job postings from public datasets (HuggingFace, Bright Data samples, research sets).

CPA · US

2 live · 0 stub · 2 planned

US-wide Certified Public Accountant data across 55 jurisdictions — state-board licensee rosters, NASBA CPAverify, PCAOB-registered firms, IRS PTIN holders, discipline actions. Target ~600–800K records by Phase 1.

Licensees
live
20,000

Active CPAs sampled from Florida Board of Accountancy (weekly bulk xlsx, ~71K total) + IRS PTIN FOIA (~208K CPA-bearing nationwide). P2: NASBA ALD per-name enrichment, TX TSBPA, CA CBA, NY NYSED.

FL CBA + IRS PTIN
Browse →
Firms
live
2,325

697 PCAOB-registered audit firms (with engagement-partner + issuer counts) merged with 1,628 AICPA GAQC governmental audit member firms + 8 State Audit Organizations.

PCAOB AuditorSearch + AICPA GAQC
Browse →
Tax preparers
planned

IRS PTIN holders (858K total, biannual FOIA CSV) — superset that also includes EAs / Attorneys / uncredentialed preparers. CPAs are already in /licensees.

IRS PTIN FOIA
Discipline
planned

PCAOB enforcement actions + inspection reports (4.3K, CSV/XML/JSON), SEC accountant suspensions, state-board disciplinary listings (NC, MN, TX, CA, …).

PCAOB / SEC / state boards

US Business

8 live · 3 stub · 0 planned

All currently-registered US business entities — SEC EDGAR public companies (10K w/ ticker, 1M+ historical CIKs), IRS exempt orgs (1.95M nonprofits), state SOS filings (NY 20.5M free), GLEIF LEI (349K US, CC0), SAM.gov vendors, PPP recipients (5M small biz).

Entities
live
59,988

Live mix at 60K total: 9K SEC public companies (with ticker + exchange) + 9K NY DOS recent formations (Socrata live) + 9K SBA PPP recipients + 3K GLEIF LEI + 30K nonprofits across all 14 downloaded IRS EO states (CA/TX/NY/FL/IL/PA/OH/GA/MI/NC/NJ/MA/VA/WA). Detail page at `/business/entities/[id]` cross-links via clusters when available.

SEC EDGAR + NY DOS + IRS EO + PPP + GLEIF
Browse →
Officers & directors
live
100,365

Named officers, directors and key employees from public US filings — SEC EDGAR Form 4 (insider transaction filings, public companies) + IRS Form 990 Part VII Section A (compensation table for nonprofits, top-paid first). Names + titles + companies + compensation. No emails — pair with company website / Apollo for outreach.

SEC EDGAR Form 4 + IRS 990 e-file XML
Browse →
Cross-source clusters
live
1,000

Linked-entity clusters where the same legal entity appears across 2+ sources — joined via shared CIK / EIN / LEI / UEI or (normalized name + state) fallback. 1,000 clusters retained (member_count ≥ 2) from union-find over 6 datasets. Click a cluster to see all source records side-by-side.

ingest/entity_link union-find
Browse →
NY corporations
live
50,000

Active NY State corporations / LLCs / LPs / nonprofits via Socrata `n9v6-gdp6` (data.ny.gov, free, anonymous, near-realtime). 50K rows of the canonical one-row-per-entity registry across 5 entity_kinds. Full corpus 20.5M filings via the per-event 63wc-4exh dataset.

NY Open Data
Browse →
FL Sunbiz
stub
60

Florida corp/LLC/LP filings via free SFTP (Public/PubAccess1845!). 60-row synthetic fixture spanning Miami/Orlando/Tampa/Jacksonville/Tallahassee. Full ingest path stubbed in `ingest/fl_sunbiz.py` with the 1440-byte fixed-width field-offset map encoded.

FL DOS Sunbiz SFTP
Browse →
PPP loan recipients
live
10,000

10K unique borrowers sampled from SBA `public_150k_plus` FOIA (968K loans / 863K unique). Top 4K by initial amount + 6K evenly-spaced mid-tier. 54 states/territories. Full deduped corpus (863,501 rows) on NAS at `staging/ppp_unique.parquet` (24 MB zstd).

SBA PPP FOIA
Browse →
GLEIF LEI
live
500

500 US LEI records sampled across 174 distinct EntityLegalForm codes and all 50 states + DC, from the GLEIF Golden Copy (3.3M global / 349K US, CC0). Streamed from the lei2 zip; relationship records (parent LEI) live in the separate rr_latest.zip and are not yet joined.

GLEIF golden copy
Browse →
SAM.gov vendors
stub
60

60-row synthetic fixture of federal vendors with realistic UEI / CAGE / NAICS across DC/VA/MD/CA/TX/FL/GA. Full ingest path stubbed in `ingest/sam_gov.py` for the SAM_PUBLIC_MONTHLY_V2 monthly extract — gated on a free api.data.gov key.

SAM.gov / api.data.gov
Browse →
USAspending recipients
live
1,351

1,351 top federal-contract recipients aggregated from USAspending.gov spending_by_award API (5K award rows → 1.35K unique UEI). Lockheed Martin $322B, Electric Boat $141B at the top. DoD-dominated; 47 states represented. Free, no auth.

api.usaspending.gov
Browse →
Financials (SEC XBRL)
live
220

220 latest-FY financial snapshots (revenue / net income / assets / employees) for popular tickers from `data.sec.gov/api/xbrl/companyfacts/`. Apple $416B, Amazon $717B, MSFT $282B captured. Full nightly companyfacts.zip (~10GB) wired in `ingest/sec_companyfacts.py`.

SEC XBRL companyfacts API
Browse →
IP assets (USPTO TM + Patent)
stub
200

200-row synthetic preview of trademark + patent holders with realistic class distribution. IBM 1180 TMs / 110K patents at the top. Full bulk path stubbed in `ingest/uspto_tm.py` + `ingest/uspto_patent.py` against USPTO Open Data Portal CSV archives.

USPTO Open Data Portal
Browse →

Chinese · Overseas

1 live · 0 stub · 3 planned

Chinese companies that have signaled intent to expand overseas — and would plausibly need US/cross-border legal counsel (corporate setup, IP, trade compliance, employment, immigration). Seeded by ~30 parallel web-search agents across distinct verticals: NEV/EV, batteries & solar, cross-border ecom (3C, apparel, home, beauty, appliances), short-drama apps, mobile games, AI/LLM, SaaS/B2B, biopharma & medical devices, payments & logistics, F&B/consumer brands, infrastructure/EPC, mining, semiconductors, unicorns, listed cos with overseas revenue.

Companies
live
1,233

Chinese companies with overseas-expansion signals (already operating abroad, hiring overseas, ODI/M&A, WIPO/USPTO trademarks, listed in 出海 media, attending intl trade shows). Each row carries name (zh+en), website(s), industry, HQ, oversea status, target markets, and any contact (email/phone/WeChat/LinkedIn) we could surface.

30 web-search agents (出海媒体 / Crunchbase / 公司官网 / 公开榜单)
Browse →
Decision-maker contacts
planned

Per-company decision makers (founder / overseas BD head / GC / international counsel). Sourced from LinkedIn, company About pages, news mentions. WeChat IDs and personal mobile rarely public — we collect what's publicly indexable.

LinkedIn + company About pages
WIPO Madrid · CN applicants
planned

Chinese applicants in WIPO Madrid international trademark registry — a high-confidence overseas-intent signal. CSV bulk available; ~50K+ Chinese applicants 2020-2026.

WIPO Madrid Monitor
MOFCOM ODI filings
planned

China Ministry of Commerce outbound direct investment (境外投资) filings — registered overseas subsidiaries by Chinese parents. Aggregated from public 商务部 disclosures + provincial 商务厅 listings.

MOFCOM + provincial 商务厅

Google Maps · Places

2 live · 0 stub · 2 planned

Operating businesses with websites, globally. Overture Maps Foundation Places (Meta + Microsoft + AWS + TomTom joint dataset, CDLA Permissive 2.0). 75.5M POI worldwide; 46.4M with at least one website. Foundation for outbound: scrape contact pages for emails, pull Google reviews, AI-distill pain points → personalized cold outreach.

Businesses (50K sample)
live
50,000

Stratified 50K sample (80 categories × 700) of US POI with websites, served from SSR JSON for instant browsing. Full 46.4M filtered parquet on NAS + S3 — see the live search below to query the entire dataset.

Overture Maps · 2026-04-15.0
Browse →
Live search (all 46.4M)
live

DuckDB queries the 4.5GB filtered parquet on S3 directly via /api/places. Cold start ~15s, warm ~2s. Search every Overture place with a website worldwide — no SSR cap.

DuckDB-Node + S3 httpfs
Browse →
Google reviews + AI pain points
planned

Per-business: 15 most-recent Google reviews via Outscraper API, then Claude Haiku distills the top 3 customer pain points. Powers personalized cold-email generation. Budget ~$200 for 3K leads.

Outscraper + Claude
Verified emails
planned

Contact-page scrape + MillionVerifier SMTP verification. Augments Overture's sparse `emails[]` field (most rows have 0). Cost ~$0.001/verification.

site scrape + MillionVerifier
storageNAS · /Volumes/data/thepenglaw-linkedin/data/
pipelinePython ingest/ (Pydantic v2) → JSON → Next.js 16 SSR
deploysAWS Amplify compatible
downstreamthepenglaw main repo · UnifiedPerson / CrmClient / Company