AI-Bio
The only AI-bio company building a proprietary, legally-clean training dataset 100x deeper than UniProt — own the data layer of TechBio and it is a generational compounder; if foundation labs (ESM, AlphaFold) reach "good enough" on public data first, it is a beautifully-sourced commodity. Watch for an EDEN-derived therapeutic deal with real economics and a priced Series C.
Research
The verdict
The only AI-bio company building a proprietary, legally-clean training dataset 100x deeper than UniProt — own the data layer of TechBio and it is a generational compounder; if foundation labs (ESM, AlphaFold) reach "good enough" on public data first, it is a beautifully-sourced commodity. Watch for an EDEN-derived therapeutic deal with real economics and a priced Series C.
Basecamp Research is a London- and Boston-based AI-bio company building the world's largest proprietary, ethically-sourced database of natural biology and the foundation models trained on top of it. Founded in 2019/2020 by two Oxford biology PhDs, Glen Gowers (CEO) and Oliver Vince, after Gowers ran the first fully off-grid DNA-sequencing expedition in the polar regions.
The business model in plain terms: Basecamp sends scientists (and pays local "guardian" partners) to collect microbial and environmental samples from extreme, biodiverse environments — Amazon rainforest, Antarctic deserts, hydrothermal vents — across 150+ locations in 28 countries. It sequences that material into a proprietary dataset (BaseData™), structures it as a knowledge graph (BaseGraph™, built on Neo4j, 5B+ biological relationships, +500M every 4 weeks), and trains AI models (BaseFold for structure, EDEN for generative design) on it. It then licenses access / co-designs proteins for industrial and therapeutic partners.
Two revenue motives, two stages:
Contract structure / key terms: B2B partnerships (no consumer product). Explicitly not a "ChatGPT for biology" public product — Vince says no consumer-facing model is planned. Revenue mix is licensing + co-development + royalties. A distinctive term: a benefit-sharing royalty flows back to source-country "guardians" — since 2023 Basecamp has paid royalties to 60 organizations across 21 countries based on use of the digital sequence information. Customer concentration is unknown but likely high (≈15 active commercial partners — see Lens 4).
+private/+clinical re-point)For a data-and-models company the "supply chain" is its data-acquisition pipeline → compute → models → partner deployment. Named stakeholders along it:
Upstream — sample sourcing (the differentiator):
Midstream — sequencing & compute:
Downstream — deployment / buyers:
Chokepoints / single-source dependencies: (1) Sequencing throughput & cost — the Trillion Gene Atlas's feasibility rests on Ultima/PacBio economics; (2) GPU access — deeply NVIDIA-dependent for both compute and capital; (3) Permitting / Nagoya-Protocol access rights per country — the moat is a chokepoint they've privileged-access through (see Lens 3).
The thesis-defining moat is the data, not the model. Three reinforcing layers:
Proprietary, deeper data. Basecamp claims its dataset holds ~10B+ novel-to-science protein sequences / 10T+ tokens of evolutionary DNA from 1M+ newly-discovered species, and is ~100x richer in "advanced biological systems" than the public databases (UniProt) most pharma uses. Everyone else (EvolutionaryScale, DeepMind/AlphaFold, Profluent) trains on the same public UniProt/PDB corpus. If model quality is increasingly data-bound (Profluent itself published "scaling laws" for protein-design data ), a proprietary corpus an order of magnitude deeper is the single most durable edge in the category. This is the whole bull case in one sentence.
Legal / provenance moat (the underrated one). Every sample is Nagoya-Protocol compliant, with tracked provenance metadata and benefit-sharing contracts. As regulators and pharma legal teams tighten on the lineage of training data (and the new BBNJ high-seas treaty lands), a competitor cannot retroactively make a scraped public dataset compliant. Pharma partners get indemnified, clean-title biology — a procurement requirement competitors can't cheaply replicate. The flip side is also the headline risk (Lens 10/13).
Compounding data flywheel + tooling. BaseGraph adds 500M relationships every 4 weeks; the tag-and-track system measures each sample's contribution to downstream outputs, enabling the royalty model and active-learning prioritization of what to sample next.
Bargaining power: Strong over industrial customers (unique enzymes, fast turnaround — a 2-year directed-evolution campaign compressed to ~1 month ). Weaker over pharma therapeutics partners, who hold the clinical/regulatory/commercialization capability and can multi-source design tools. Over suppliers: high dependence on NVIDIA (mitigated by NVentures equity alignment) and on sovereign access rights.
Moat durability — the honest read: The data moat is real today and hard to copy fast. It is not permanent — if foundation models on public data reach "good enough" for most design tasks before Basecamp monetizes the depth premium, the proprietary edge compresses to niche/hard-target value. The legal moat is more durable than the data-depth moat.
No audited segment data exists (segments.csv empty; private). Qualitatively, revenue spans five application verticals with an unknown split:
| Vertical | Status | Named anchors | Trend |
|---|---|---|---|
| Industrial enzymes / manufacturing | Live, largest near-term | $16B chemicals co.; cold-water detergent (P&G) | Core revenue today |
| Cosmetics / consumer | Live | Colorifix (dyes) | Growing |
| Food & nutrition | Live | unnamed | Growing |
| Bioremediation / sustainability | Live | unnamed | Early |
| Pharma / therapeutics | Strategic pivot | Broad/Liu Lab; aiPGI gene insertion | The future weight |
Trend & cause: ~15 commercial partnerships were the base at Series B (Oct 2024); the company is decelerating breadth and concentrating on therapeutics — the Liu collaboration (2024), EDEN/aiPGI (Jan 2026), and the Trillion Gene Atlas (Mar 2026) all point the same way: industrial cash-flow today, genetic-medicine optionality as the value driver. Geographic split: n/a — private, not disclosed.
+private swap for "Earnings Result")No earnings print exists. The scoreboard is the financing trajectory:
| Round | Date | Amount | Lead | Post-money | Source |
|---|---|---|---|---|---|
| Seed (cumulative) | pre-2022 | ~$10M | True Ventures, Hummingbird | n/a | |
| Series A | Dec 2022 | $20M | Systemiq Ventures (Valo, Blue Horizon, True, Hummingbird) | $71M | |
| Series B | Oct 2024 | $60M (£45.9M) | Singular (S32, redalpine; angels André Hoffmann/Roche, Feike Sijbesma/Philips, Paul Polman/Unilever; True, Hummingbird) | undisclosed "up-round" | |
| Pre-Series C | 2025–26 | undisclosed | NVentures (NVIDIA VC) | undisclosed | |
| Total raised | — | ~$85M | — | — |
Revenue (the one hard number): ~$2.7M revenue with 37 employees, FY2023 per getlatka/Latka. Treat as a low-confidence third-party estimate (unaudited), but it is the only revenue figure on record and it tells the real story: this is a pre-scale, R&D-and-data-heavy company whose value is in the dataset and optionality, not current cash flow.
Valuation flag — handle with care: Third-party aggregators show secondary-share-sale references at a $5B valuation — this is unverified, likely conflated/erroneous secondary data, and inconsistent with a $71M post-A (2022) and an undisclosed up-round 2024. Do not anchor on $5B. Best defensible read: a low-hundreds-of-millions post-B, with the NVentures pre-C and Trillion Gene Atlas momentum pushing the next priced round materially higher. Mark: n/a — not reliably sourced.
Burn signal: Raising ~$85M to fund a planetary sampling + GPU-cluster program against ~$2.7M revenue ⇒ deeply capital-consumptive; runway and burn are not disclosed but the pre-Series C raise implies the Series B is largely deployed.
+private swap for "Earnings Calls")No earnings calls. Proxy = founder cadence and message consistency across press/conference appearances (JPM 2026, SXSW, Forbes, VIB). Tone trajectory has shifted from "biodiversity data company" (2022 Series A: "design protein products reflecting the world's biodiversity") → "GPT for biology" / foundation-model framing (2024 Series B) → "design therapeutics from a disease prompt" (2026 EDEN/Trillion Gene Atlas). The message has consistently escalated in ambition and moved up the value chain toward medicine — a deliberate repositioning from industrial-enzyme vendor to TechBio foundation-model platform. Recurring phrases: "beyond known biology," "ethically sourced," "scaling laws for biology," "100x deeper than public data." What's notable is consistency of the data-moat story across four years — they have not pivoted the core thesis, only raised the ceiling on what it enables.
+private/+clinical swap)Syndicate quality (the +private IPO-proximity tell): Mixed-but-strengthening. Early backers True Ventures + Hummingbird are credible but not crossover funds. The 2024 round added strategic gravity rather than crossover capital: Singular (top European VC), S32 (deep-tech), and an unusually heavyweight angel bench — André Hoffmann (Roche vice-chair), Feike Sijbesma (ex-DSM CEO, Philips chair), Paul Polman (ex-Unilever CEO) — i.e. the people who buy this technology at industrial scale. The pre-Series C NVentures entry is a strategic/compute alignment, not a Fidelity/T. Rowe crossover. Verdict: no tier-1 crossover (Fidelity/Coatue/T. Rowe) on the cap table yet → IPO is not imminent; this is a private compounding story for now.
Peer / mechanism comps (no P/E possible — all private or development-stage):
| Company | Approach | Funding | Status | Source |
|---|---|---|---|---|
| EvolutionaryScale | ESM3 protein LLM (public-data) | $142M seed | Acquired by CZ Biohub, Nov 2025 | |
| Profluent | Protein-design LLMs, "scaling laws," OpenCRISPR | $150M (Series B $106M) | Private | |
| Cradle | Protein-engineering SaaS | $103M | Private | |
| Latent Labs | De-novo design, web tool (ex-AlphaFold) | ~$50M | Private | |
| Generate Biomedicines | Generative protein therapeutics | $65M Novartis upfront + equity | Private | |
| Isomorphic Labs | AlphaFold3 drug discovery | Alphabet-backed | Subsidiary | |
| DeepMind / AlphaFold | Structure prediction (public) | — | Alphabet | — |
| Basecamp Research | Proprietary data + foundation models | ~$85M | Private |
Differentiated positioning: Basecamp is the only one of these whose primary asset is a proprietary dataset rather than a model architecture trained on shared public data. EvolutionaryScale's acquisition by CZ Biohub (Nov 2025) is a double-edged comp — it validates the category's strategic value and signals that a model-only player struggled to build a standalone commercial business (it became an internal capability for a non-profit). That asymmetry is the bull case for owning data instead.
AI-protein-design TAM: $1.5B (2025) → $6.98B (2033), 21.2% CAGR.
+private swap — funding/product events, no stock)No traded stock; "catalysts" = milestone events that re-rated the private narrative:
Pattern: The story re-rates on (a) data-scale proof points and (b) credibility-by-association with marquee partners (NVIDIA, Anthropic, Broad). What the "market" (investors) reacts to is evidence the data depth converts into model capability — exactly what EDEN's gene-insertion results were designed to show. The next re-rating trigger is a therapeutic asset with real economics (a partnered drug program with milestones/royalties), not another dataset record.
Track record: Built, from a polar expedition, the largest proprietary biodiversity-genomics dataset in the world and raised ~$85M from credible investors in ~5 years — strong execution for a deep-science startup. No prior exits on record (first-time founders).
Skin in the game: Founder-led, both co-founders active; insider ownership not disclosed but presumptively high (private, two priced rounds). n/a — not disclosed.
Capital allocation: Heavy, deliberate reinvestment into the data moat and compute (the single biggest, most defensible bet they could make) plus the benefit-sharing royalty program (reputational/legal capital). No buybacks/dividends (irrelevant at stage). The Trillion Gene Atlas is a large capital commitment — execution risk is real but it's the right thing to spend on if the data thesis is correct.
Red flags (management-level): First-time operators scaling a capital-intensive planetary program against thin revenue — classic vision-vs-commercialization tension. The repeated escalation of ambition (industrial → "GPT for biology" → cure disease on demand) is either genuine capability expansion or narrative inflation to support up-rounds; the EDEN results lean toward the former, but watch the gap between press-release capability and partnered-program economics.
Founder vs. professional manager: Pure scientist-founder archetype — right for the data/science-building phase; the open question is whether therapeutics commercialization eventually needs a pharma-seasoned operator alongside.
regulatory/regulatory-findings.md): No CIK; not an SEC registrant; no EDGAR Litigation Releases or AAERs possible. Web search for "Basecamp Research" (FTC OR DOJ OR FDA OR consent decree OR settlement OR penalty) surfaced no enforcement actions as of 2026-06-23 — the only material public criticism is the FT biocolonialism reporting (reputational, not enforcement).Summary: No regulatory or legal enforcement findings — verified via SEC EDGAR EFTS (LR, AAER — n/a, no CIK), web search, and public reporting as of 2026-06-23. The one material risk is the Nagoya/benefit-sharing/biocolonialism axis, which is reputational-regulatory, not accounting.
+private +clinical swap for EPS)No EPS projection (no P&L, pre-scale). Two forward questions matter:
(A) +private — Path to tradeable.
(B) +clinical — optionality value. The therapeutic upside is genuine option value, not modellable rNPV yet (no named clinical-stage asset with peak-sales/PoS inputs; aiPGI is platform/preclinical). The value driver is does the data depth + EDEN convert into partnered programs that pay milestones and royalties before cash runs out and before public-data models close the gap. That is the binary to track.
No forecast.ts create — per --watchlist rules and because there is no committable EPS/binary-readout base case (no named asset, no fiscal P&L). The honest scoreable forecast here would be financing/partnership-event-based, which the tracker isn't shaped for.
Bull case. Basecamp owns the one input every rival lacks: a proprietary, legally-clean biological dataset ~100x deeper than the public corpus everyone else trains on. If protein/genome model quality is data-bound (and the field's own scaling-law work says it increasingly is), then in a world where model architectures commoditize, the data layer captures the rents — Basecamp becomes the "Bloomberg/Foundry of biology," licensing clean, deep, indemnified biology to every pharma and industrial player, while compounding the moat 500M relationships every 4 weeks. EDEN's gene-insertion results show the depth converts to capability. NVIDIA and Anthropic as compute/model partners and (NVIDIA) investor de-risk the build. Optionality: any one EDEN-designed therapeutic that reaches the clinic re-rates the whole company. It's a generational data-monopoly bet in the highest-value AI vertical.
Bear case (permanent-impairment risks).
Pre-mortem (18 months out, thesis broke): Series C either doesn't price or prices flat; EDEN's headline results don't reproduce independently / no therapeutic partner signs real economics; a public-data model matches BaseFold on the tasks customers actually pay for; and a sovereign dispute makes one flagship dataset legally radioactive. Industrial revenue stays sub-$10M and the company is acqui-hired by a sequencing or cloud strategic at a flat-to-down mark.
Are multiples too high? Unmeasurable (private, undisclosed). The $5B secondary reference, if real, would be wildly ahead of ~$2.7M revenue — a pure data-optionality bet. Contrarian view of what the market refuses to see: the consensus debate is "AI protein design model wars" (ESM vs. AlphaFold vs. Profluent); the real question almost no one is pricing is whether proprietary biological data is a defensible product or merely a temporary input that public corpora + better architectures eventually subsume. Basecamp is the purest long on "data is the moat." If that's right it's a monopoly; if it's wrong it's a beautifully-sourced commodity.
Dismantling the bull case:
Single scenario that permanently impairs: a credible independent demonstration that a public-data foundation model matches Basecamp's designs on the commercially-relevant targets — that would reveal the proprietary dataset as a cost, not a moat, and there is no recovery from "you built a wall nobody needed."
A fortress-margin vertical-SaaS monopoly trading at a growth-stock funeral price (~20x forward EPS, near 52-wk lows) because the market is pricing a Salesforce-Agentforce CRM war that threatens the contested ~40% (Commercial) while ignoring the defensible, faster-growing ~60% (R&D/Quality); BULLISH at $153 on a 1–3Y view, but the CRM-migration-to-2030 is a real, watchable execution overhang — not a phantom.
A real, fast-growing oncology-data + diagnostics franchise wrapped in an "AI" narrative it can't yet monetize — own the genomics flywheel, but the round-trip-flavored deals, 30-vote founder, and a CEO famous for cashing out cap the multiple until cash flow turns.
Not a tools company anymore — a sub-NAV cash shell mid-conversion into Treeline's oncology pipeline; the only edge is the deal-spread between ~$325M market cap and the ~$460M net cash being delivered, and that spread is a bet on Bilenker's KRAS/BCL6 readouts, not on CyTOF.