ThePengLaw · Data explorer
Multi-source staging for the law-firm CRM pipeline. Each source is a data vertical (LinkedIn profiles & jobs, US-wide CPA licensees, …); each card under it is a dataset with its own browser.
Profiles, companies, and job postings from public datasets (HuggingFace, Bright Data samples, research sets).
Real LinkedIn profiles (Label=1) from navid-aub research set.
LinkedIn company pages from the Bright Data sample set.
Combined ~32K postings from 4 public LinkedIn job datasets — lukebarousse 220MB (data jobs), datastax 493MB (general LinkedIn), xanderios 129MB (snapshot), mlawrence 73MB (tech jobs). Per-source quota of 8K each.
CPA · US
2 live · 0 stub · 2 plannedUS-wide Certified Public Accountant data across 55 jurisdictions — state-board licensee rosters, NASBA CPAverify, PCAOB-registered firms, IRS PTIN holders, discipline actions. Target ~600–800K records by Phase 1.
Active CPAs sampled from Florida Board of Accountancy (weekly bulk xlsx, ~71K total) + IRS PTIN FOIA (~208K CPA-bearing nationwide). P2: NASBA ALD per-name enrichment, TX TSBPA, CA CBA, NY NYSED.
697 PCAOB-registered audit firms (with engagement-partner + issuer counts) merged with 1,628 AICPA GAQC governmental audit member firms + 8 State Audit Organizations.
IRS PTIN holders (858K total, biannual FOIA CSV) — superset that also includes EAs / Attorneys / uncredentialed preparers. CPAs are already in /licensees.
PCAOB enforcement actions + inspection reports (4.3K, CSV/XML/JSON), SEC accountant suspensions, state-board disciplinary listings (NC, MN, TX, CA, …).
US Business
8 live · 3 stub · 0 plannedAll currently-registered US business entities — SEC EDGAR public companies (10K w/ ticker, 1M+ historical CIKs), IRS exempt orgs (1.95M nonprofits), state SOS filings (NY 20.5M free), GLEIF LEI (349K US, CC0), SAM.gov vendors, PPP recipients (5M small biz).
Live mix at 60K total: 9K SEC public companies (with ticker + exchange) + 9K NY DOS recent formations (Socrata live) + 9K SBA PPP recipients + 3K GLEIF LEI + 30K nonprofits across all 14 downloaded IRS EO states (CA/TX/NY/FL/IL/PA/OH/GA/MI/NC/NJ/MA/VA/WA). Detail page at `/business/entities/[id]` cross-links via clusters when available.
Named officers, directors and key employees from public US filings — SEC EDGAR Form 4 (insider transaction filings, public companies) + IRS Form 990 Part VII Section A (compensation table for nonprofits, top-paid first). Names + titles + companies + compensation. No emails — pair with company website / Apollo for outreach.
Linked-entity clusters where the same legal entity appears across 2+ sources — joined via shared CIK / EIN / LEI / UEI or (normalized name + state) fallback. 1,000 clusters retained (member_count ≥ 2) from union-find over 6 datasets. Click a cluster to see all source records side-by-side.
Active NY State corporations / LLCs / LPs / nonprofits via Socrata `n9v6-gdp6` (data.ny.gov, free, anonymous, near-realtime). 50K rows of the canonical one-row-per-entity registry across 5 entity_kinds. Full corpus 20.5M filings via the per-event 63wc-4exh dataset.
Florida corp/LLC/LP filings via free SFTP (Public/PubAccess1845!). 60-row synthetic fixture spanning Miami/Orlando/Tampa/Jacksonville/Tallahassee. Full ingest path stubbed in `ingest/fl_sunbiz.py` with the 1440-byte fixed-width field-offset map encoded.
10K unique borrowers sampled from SBA `public_150k_plus` FOIA (968K loans / 863K unique). Top 4K by initial amount + 6K evenly-spaced mid-tier. 54 states/territories. Full deduped corpus (863,501 rows) on NAS at `staging/ppp_unique.parquet` (24 MB zstd).
500 US LEI records sampled across 174 distinct EntityLegalForm codes and all 50 states + DC, from the GLEIF Golden Copy (3.3M global / 349K US, CC0). Streamed from the lei2 zip; relationship records (parent LEI) live in the separate rr_latest.zip and are not yet joined.
60-row synthetic fixture of federal vendors with realistic UEI / CAGE / NAICS across DC/VA/MD/CA/TX/FL/GA. Full ingest path stubbed in `ingest/sam_gov.py` for the SAM_PUBLIC_MONTHLY_V2 monthly extract — gated on a free api.data.gov key.
1,351 top federal-contract recipients aggregated from USAspending.gov spending_by_award API (5K award rows → 1.35K unique UEI). Lockheed Martin $322B, Electric Boat $141B at the top. DoD-dominated; 47 states represented. Free, no auth.
220 latest-FY financial snapshots (revenue / net income / assets / employees) for popular tickers from `data.sec.gov/api/xbrl/companyfacts/`. Apple $416B, Amazon $717B, MSFT $282B captured. Full nightly companyfacts.zip (~10GB) wired in `ingest/sec_companyfacts.py`.
200-row synthetic preview of trademark + patent holders with realistic class distribution. IBM 1180 TMs / 110K patents at the top. Full bulk path stubbed in `ingest/uspto_tm.py` + `ingest/uspto_patent.py` against USPTO Open Data Portal CSV archives.
Chinese · Overseas
1 live · 0 stub · 3 plannedChinese companies that have signaled intent to expand overseas — and would plausibly need US/cross-border legal counsel (corporate setup, IP, trade compliance, employment, immigration). Seeded by ~30 parallel web-search agents across distinct verticals: NEV/EV, batteries & solar, cross-border ecom (3C, apparel, home, beauty, appliances), short-drama apps, mobile games, AI/LLM, SaaS/B2B, biopharma & medical devices, payments & logistics, F&B/consumer brands, infrastructure/EPC, mining, semiconductors, unicorns, listed cos with overseas revenue.
Chinese companies with overseas-expansion signals (already operating abroad, hiring overseas, ODI/M&A, WIPO/USPTO trademarks, listed in 出海 media, attending intl trade shows). Each row carries name (zh+en), website(s), industry, HQ, oversea status, target markets, and any contact (email/phone/WeChat/LinkedIn) we could surface.
Per-company decision makers (founder / overseas BD head / GC / international counsel). Sourced from LinkedIn, company About pages, news mentions. WeChat IDs and personal mobile rarely public — we collect what's publicly indexable.
Chinese applicants in WIPO Madrid international trademark registry — a high-confidence overseas-intent signal. CSV bulk available; ~50K+ Chinese applicants 2020-2026.
China Ministry of Commerce outbound direct investment (境外投资) filings — registered overseas subsidiaries by Chinese parents. Aggregated from public 商务部 disclosures + provincial 商务厅 listings.
Google Maps · Places
2 live · 0 stub · 2 plannedOperating businesses with websites, globally. Overture Maps Foundation Places (Meta + Microsoft + AWS + TomTom joint dataset, CDLA Permissive 2.0). 75.5M POI worldwide; 46.4M with at least one website. Foundation for outbound: scrape contact pages for emails, pull Google reviews, AI-distill pain points → personalized cold outreach.
Stratified 50K sample (80 categories × 700) of US POI with websites, served from SSR JSON for instant browsing. Full 46.4M filtered parquet on NAS + S3 — see the live search below to query the entire dataset.
DuckDB queries the 4.5GB filtered parquet on S3 directly via /api/places. Cold start ~15s, warm ~2s. Search every Overture place with a website worldwide — no SSR cap.
Per-business: 15 most-recent Google reviews via Outscraper API, then Claude Haiku distills the top 3 customer pain points. Powers personalized cold-email generation. Budget ~$200 for 3K leads.
Contact-page scrape + MillionVerifier SMTP verification. Augments Overture's sparse `emails[]` field (most rows have 0). Cost ~$0.001/verification.