TL;DR
- The prompt set is the foundation. Every AI visibility metric — share of voice, citation rate, sentiment — is downstream of which prompts you track. Imagined prompts produce imagined results.
- Two modes meet brands where they are. Cold start takes only a domain and produces 25 starter prompts across 5 topics in under a minute. Deep research takes a seed topic and runs a full multi-source pipeline producing 12–20 canonical clusters in minutes.
- Four sources sampled in parallel. Reddit (raw user phrasing), Google PAA (algorithmic intent), Quora (professional voice), and web-grounded LLM fan-outs (Claude with live web search to anchor synthetic prompts in current discussion).
- Two-axis tagging on every prompt. Awareness stage (Problem Unaware / Problem Aware / Solution Aware) plus one of nine intent buckets — so you can see where the brand is missing in the funnel and what content gap that implies.
- Web search closes the cutoff gap. Claude issues real searches during generation, observing current forum threads, news, and competitor framings instead of relying on training data alone.
- Output is a measurement plane, not a wish-list. CSVs designed for direct import into a tracker, with canonical prompts selected by coverage so one prompt represents many.
- Faster than manual, more transparent than proprietary. Compresses Omniscient's analyst workflow and narrows the gap with Profound's closed dataset, with methodology you can audit end to end.
If you're tracking how your brand shows up in ChatGPT, Claude, Gemini, or Perplexity, the most important decision you'll make isn't which platform to use. It's which prompts to track.
Every AI visibility number — share of voice, citation rate, sentiment, competitive ranking — is downstream of one input: the list of prompts being measured. Get the prompt set wrong and every dashboard built on top of it tells you something untrue with confidence. Get it right and you have a defensible baseline for every decision that follows.
At Indexly, we believe prompt research is the most under-engineered layer of the AI visibility stack. The industry has spent the last eighteen months building beautiful interfaces on top of prompt sets that are still, in most cases, hand-curated by marketers guessing at what their customers are typing into LLMs. That gap is what we set out to close.
This is how we think about it.
The problem with how prompt sets are built today
Most AI visibility platforms ask you to bring your own prompts. You sit down, brainstorm twenty or thirty queries you think your customers might ask, and import them. The platform then runs those across LLMs and shows you results.
The flaw is obvious once you name it: you're measuring how the brand performs against questions you imagined, not questions your market is actually asking.
Two competitors have started to address this. Profound ships Prompt Research Reports built on a proprietary dataset of 1.5 billion real user prompts captured from answer engines — genuinely impressive, but locked to their own data pipeline and pricing tier. Omniscient Digital published a thoughtful framework grounded in the Eugene Schwartz five-stage awareness model, but the workflow itself is manual: a human analyst spending days assembling Ahrefs exports, Reddit threads, Quora questions, and Perplexity follow-ups into a balanced spreadsheet.
Both approaches produce excellent prompt sets. Neither scales to the speed at which brands actually need to operate.
We took a different position: the methodology Omniscient documented is sound, the data sources Profound aggregates are real, and the right move is to compress that entire workflow — research design, data gathering, classification, clustering, canonical selection — into a single automated pipeline that any brand can run on their own domain. Then go one step further: ground every step in current real-world conversation by giving the language model live web access during generation.
Two modes, two starting points
A brand approaching AI visibility research is in one of two situations. Either they know exactly what topic they want to investigate — a specific product line, a competitive frontier, a category they're trying to break into — or they don't, and they need a balanced starting point that spans their entire market footprint.
We built a mode for each.
Cold start mode takes a single input: the domain. Nothing else. The system crawls the brand's site, identifies five distinct topics that represent the brand's strategic positioning, and generates five starter prompts per topic distributed across the buyer journey. Twenty-five prompts, ready to import as the brand's initial AI visibility tracking set, in under a minute. This is what runs at onboarding.
Deep research mode takes a domain plus a seed topic, and runs the full multi-source pipeline against that topic — pulling real-user prompts from Reddit, Quora, Google's People Also Ask, and structured LLM fan-outs, then filtering, embedding, deduplicating, classifying, clustering, and selecting canonical representatives. This is what runs when a brand wants to go deep on a specific market.
The two modes share infrastructure but answer different questions. Cold start answers "what should I be measuring across my entire surface area?" Deep research answers "for this specific topic, what are the trackable canonical prompts and which clusters dominate share of voice?" Most brands run cold start first, pick the most strategic topic from the output, then run deep research on it. Both produce structured outputs ready for a tracker.
What grounds the output in reality
Both modes use Claude with the live web search tool enabled at the most strategically loaded generation steps. This matters more than it might initially sound.
A language model without web access generates prompts based entirely on patterns in its training data — which has a cutoff. For a category like AEO/GEO, where competitive landscapes shift monthly and new tools launch weekly, that cutoff is a real problem. Training-grounded fan-outs feel plausible but reflect the world as it was, not as it is.
When we enable web search, the model issues real searches during generation: "Indexly vs," "best AI visibility tools 2026," "how do people track LLM visibility on Reddit." It reads current forum threads, current news, current competitor framings, and uses what it observes to ground topic selection and prompt phrasing. A user query that became dominant in the last three months will appear in the output. A retired competitor or stale framing won't.
This is what closes the gap with proprietary real-conversation datasets. We don't have 1.5 billion captured prompts. We have something different: a model that can read the current public web during generation, anchored by a methodology that ensures structural balance and stage coverage, fed into a clustering and canonical-selection pipeline that produces measurement-ready output.
What the deep research pipeline actually does
When a brand provides a seed topic, the pipeline runs in five stages.
Stage 1: Brand context derivation. We crawl the brand's marketing site — homepage, product pages, pricing, about — and use Claude to extract three things: brand name, a precise category description, and eight category keywords weighted toward the seed topic provided. The seed topic disambiguates multi-product brands. A company running both an SEO platform and a content workflow tool gets very different keyword extractions depending on which angle the operator cares about today.
Stage 2: Multi-source retrieval.
- Reddit — question-shaped thread titles searched by keyword via the official OAuth API. Reddit captures the rawest user phrasing in any category, especially in technical and professional communities.
- Google People Also Ask — fetched through DataForSEO with
people_also_ask_click_depthset to expand follow-up questions. PAA reflects what Google itself believes are the natural follow-on questions users ask. - Quora — surfaced via DataForSEO
site:quora.comSERPs, filtered to question-shaped titles. Quora carries a different demographic and tone than Reddit; together they give us coverage across both consumer and professional voice. - Stage-balanced LLM fan-outs — Claude generates additional prompts following a strict three-stage distribution (Problem Unaware, Problem Aware, Solution Aware) with controlled qualifier density at each stage. This last source is where we explicitly draw on the Omniscient framework, and it's structured rather than improvisational.
The reason for four sources isn't redundancy. It's diversity of voice. A prompt set built only on PAA reflects how Google's algorithm summarizes intent. A set built only on Reddit reflects how technical users phrase frustration. A set built only on training-data fan-outs reflects only what the model already knows. Real LLM users span all four voices, and our pipeline samples all four.
Stage 3: Filtering, embedding, and deduplication. A first-pass relevance filter scores every retrieved prompt 0.0–1.0 against the brand and seed topic. Anything below threshold is dropped. Surviving prompts are embedded with OpenAI's text-embedding-3-small model, and near-duplicates are removed greedily, with higher-relevance prompts winning ties. This typically cuts the pool by 30–50%, which is correct — Reddit, PAA, and Quora often surface paraphrases of the same underlying question.
Stage 4: Two-axis classification and ranking. Every surviving prompt is tagged on two axes: an awareness stage (Problem Unaware, Problem Aware, or Solution Aware) and one of nine intent buckets (informational, jobs-to-be-done, how-to, direct brand evaluation, category exploration, comparison, proof, implementation, pricing). Each prompt also gets a qualifier count — how many user-side constraints (persona, company size, budget, tool stack, compliance) it carries — and a quality score for specificity and intent clarity.
This two-axis tagging is non-negotiable in our view. A prompt like "what is GEO" and a prompt like "how do I track LLM visibility" are both informational under a flat taxonomy, but they live at completely different funnel positions. Knowing the difference is what lets a brand decide whether they have a top-of-funnel awareness gap or a bottom-of-funnel evaluation gap.
Stage 5: Clustering and canonical selection. We run k-means with k-means++ initialization on the embedded prompts. Each cluster is named by Claude with one explicit constraint: the label cannot be the brand name and cannot be the seed topic, because a label that fits everything tells you nothing. The label has to identify what makes that cluster different from the others. For each cluster, we compute a coverage score — the average cosine similarity between a candidate prompt and every other member. The prompt with the highest coverage is the canonical: the one that most efficiently represents the entire cluster's intent. This is what you track. The rest are variations you'd see organically.
What you get back
Cold start mode produces: twenty-five prompts across five topics, each tagged with awareness stage and intent bucket. The columns are designed for direct import into a tracking system — topic, prompt, stage, intent, with a one-sentence rationale per topic.
Deep research mode produces clusters and prompts.
- One row per prompt: every member of every cluster, tagged with stage, intent bucket, qualifier count, source, quality score, coverage score, and whether it's the canonical.
- The second is one row per cluster: cluster size, share of total volume, awareness stage breakdown, average qualifier count, and the canonical prompt. This is the executive view — at a glance you can see whether your category conversation is happening at top of funnel ("what is X") or solution stage ("X vs Y, pricing for X").
For most B2B SaaS brands we've tested this on, a deep research run produces 12–20 clusters from a starting pool of 200–400 raw prompts. Cold start completes in under a minute. Deep research completes in minutes, not days.
Why we think this matters
The next twelve months of AI visibility will reward operators who can move from prompt to insight quickly. The current alternative — either pay for proprietary datasets you can't audit, or spend a week assembling a manual prompt set — doesn't scale to the cadence at which markets are now shifting.
Our position at Indexly is straightforward: prompt research should be a one-command operation, the methodology should be transparent, the source mix should be diverse enough to survive scrutiny, the language model behind it should have access to current real-world conversation rather than a training cutoff, and the output should be structured enough to drive content and product decisions, not just dashboards.
The future of AI visibility tracking belongs to whoever can compress the loop between "what is the market asking right now" and "what should we do about it" the tightest. We think we're building it.
Indexly is an AI visibility platform helping brands track and improve their presence across ChatGPT, Claude, Gemini, Perplexity, and other answer engines. Prompt research is shipping next.
Frequently Asked Questions
What is prompt research in AI visibility tracking?
Prompt research is the process of identifying which prompts to track when measuring how a brand appears in answer engines like ChatGPT, Claude, Gemini, and Perplexity. It sits upstream of prompt tracking. Tracking measures performance against a defined set of prompts; research determines which prompts belong in that set. Without rigorous research, every downstream metric — share of voice, citation rate, sentiment — measures the wrong thing.
How is prompt research different from prompt tracking?
Prompt tracking monitors a defined set of prompts across LLMs and reports brand mentions, share of voice, and sentiment over time. Prompt research is the upstream step that answers a different question: which prompts should be tracked in the first place? Tracking is operational and recurring. Research is foundational and usually runs at onboarding or when a brand pivots its category positioning.
What is cold start mode and when should I use it?
Cold start mode is Indexly's onboarding pathway. It takes a single input — the brand's domain — and generates a starter prompt set of five topics with five prompts each, distributed across the buyer journey, in under a minute. Use it when you're setting up AI visibility tracking for a brand and don't yet know which specific topic to investigate. The output is ready to import directly as the brand's initial tracking configuration. Most brands run cold start first, then run deep research on whichever topic from the output looks most strategic.
Why does the prompt set matter more than the tracking platform?
Every AI visibility metric is downstream of the prompt list. If the prompts being measured don't reflect what real users ask, the resulting numbers describe an imagined market rather than the actual one. Two platforms running on the same accurate prompt set will produce similar insights. Two platforms running on different prompt sets will disagree even if both are technically correct. The prompt set is the foundation that determines whether the entire measurement is valid.
What sources does Indexly use for prompt research?
In deep research mode, Indexly retrieves prompts from four independent sources in parallel — each capturing a different user voice.
Reddit contributes question-shaped thread titles via OAuth, capturing raw phrasing from technical and professional communities.
Google People Also Ask, fetched through DataForSEO with click-depth expansion, reflects how Google's algorithm summarizes natural follow-on questions. Quora, surfaced via site-restricted SERP queries, contributes a more professional and career-oriented voice.
The fourth source is web-grounded LLM fan-outs: Claude generates stage-balanced prompts with live web search enabled, issuing real searches against the open web to anchor synthetic prompts in current real-world conversation rather than training-data inference.
The four sources are deliberately diverse — a Reddit-only set over-indexes on technical frustration, a PAA-only set over-indexes on algorithmic intent, a training-data-only fan-out misses recent shifts. Together they sample the full voice of how real users ask questions.
How does Indexly use Claude's web search?
Claude's web search tool is enabled at the most strategically loaded generation steps in both modes. In cold start mode, it grounds topic discovery in current discussion of the brand and category. In deep research mode, it grounds the seed-topic fan-out generation in current forum threads, news, and competitor framings. This matters because language models without web access generate prompts based entirely on training data, which has a cutoff. Web search lets the model observe what users are actually asking right now rather than what they were asking when the model was trained.
What is awareness stage classification and why does Indexly use it?
Awareness stage classification tags each prompt as Problem Unaware (learning the topic), Problem Aware (exploring how to solve it), or Solution Aware (evaluating specific vendors). The framework comes from Eugene Schwartz's five-stage awareness model adapted for LLM queries by Omniscient Digital. Indexly applies it because flat intent labels like "informational" or "transactional" hide critical funnel position. Two informational prompts can sit at completely different points in a buyer's journey and require very different content responses.
How long does an Indexly prompt research run take?
Cold start mode completes in under a minute — site crawl plus a single web-grounded generation call. Deep research mode completes in minutes rather than days, even with full multi-source retrieval, embedding, clustering, and ranking. The pipeline is fully automated: brand context derivation, four-source retrieval, filtering, classification, clustering, and canonical selection all run from a single command. Most B2B SaaS brands produce 12 to 20 clusters from a starting pool of 200 to 400 raw prompts in deep research mode.
What does Indexly's prompt research output look like?
Cold start produces one CSV with twenty-five starter prompts across five topics, tagged with awareness stage and intent bucket, ready for direct import into a tracking system. Deep research produces two CSVs. The first contains one row per prompt with awareness stage, intent bucket, qualifier count, source, quality score, coverage score, and a flag indicating which prompt is canonical for its cluster. The second is a cluster-level summary showing share of volume, stage distribution, average qualifier density, and the canonical prompt for each cluster. The canonical prompt is the most representative prompt in a cluster — measured by average cosine similarity to all other cluster members — and is the one a brand would track.
How does Indexly compare to Profound and Omniscient Digital for prompt research?
Profound builds prompt research on a proprietary dataset of real user prompts captured directly from answer engines, which is powerful but locked to their data pipeline and pricing tier. Omniscient Digital documented a rigorous manual methodology grounded in awareness-stage modeling, but the workflow requires a human analyst to assemble prompt sets across multiple sources over several days. Indexly compresses both approaches into a single automated pipeline that any brand can run on its own domain, with a transparent methodology, a diverse public source mix, and live web search grounding so the language model behind the synthesis observes current real-world conversation rather than relying on training data alone.