Every safety rating and review on CompanionWise starts with evidence. Not opinions, not hunches, not a quick scroll through an app’s marketing page. We read the privacy policies, comb through terms of service, search for regulatory actions, and analyze hundreds of app store reviews before we assign a single score. These standards feed directly into our guide on evaluating AI companion safety, where we turn evidence into a practical checklist for consumers. This page explains exactly how we do it.
TL;DR: We classify sources into four tiers. Tier 1 (official documents and regulatory filings) and Tier 2 (major news outlets and academic research) are the foundation of every rating. Tier 3 (user reviews from platforms like Google Play, as used in our Talkie AI review with 150 analyzed user review patterns) adds context but can’t support strong claims on its own. Tier 4 (single unverified reports) is never published. Every source URL is verified, date-stamped, and re-checked on a set schedule.
Why Evidence Standards Matter
AI companion apps change their policies more often than most people realize. A privacy policy that looked reasonable in January might add broad data-sharing clauses by March. Terms of service get rewritten. Safety features appear and disappear. If we’re going to publish scores that parents, adults, and journalists rely on, those scores need to reflect what’s actually happening right now.
We’ve seen other review sites publish claims based on outdated screenshots or a single Reddit post. That’s not good enough. When someone reads our safety rating for an app and decides whether to let their teenager use it, the evidence behind that score needs to be verifiable, current, and honest about its limitations.
These standards exist so you can check our work. Every rating links back to the sources we used. If we got something wrong, our corrections process is open to anyone, including the app developers themselves.
How We Gather Evidence
For every AI companion app we rate, we collect evidence from multiple independent sources. No single document tells the full story, so we pull from at least five categories before we start scoring.
- Privacy policies: The full text, not a summary. We record what data the app collects, who it shares data with, how long it keeps your information, and whether it uses your conversations to train AI models.
- Terms of service: We look at who owns the content you create in conversations, under what conditions your account can be terminated, whether disputes go to arbitration, and what age restrictions exist.
- App store listings: Both Google Play and Apple App Store. We check the privacy nutrition labels, age ratings, recent update notes, and the reviews themselves.
- Official safety pages: Some apps publish dedicated trust or safety pages. When they exist, we document their crisis response mechanisms, content moderation descriptions, and parental controls.
- Regulatory filings and news: We run mandatory searches for fines, FTC complaints, GDPR violations, COPPA issues, government bans, and safety incidents. This applies to every app, even ones that look clean from their own documentation.
Every source URL goes through a verification check. We confirm each link returns an HTTP 200 status code. Dead links get flagged and we find alternative sources. We also record the date each policy was last updated by the company and the date we last accessed it. If a privacy policy hasn’t been updated since 2023, that tells us something too.
Source Quality Tiers
Not all evidence is created equal. A direct quote from an app’s privacy policy carries more weight than a Reddit thread. A documented FTC fine matters more than a YouTube commentary video. We use a four-tier system to classify every piece of evidence.
Tier 1: Primary Sources
These are the foundation of every safety rating. Tier 1 sources include the app’s own official documents (privacy policy, terms of service, safety pages), regulatory actions from government agencies, and official press releases from the company. When direct behavioral testing is conducted in future methodology versions, screenshots and documented test results will also qualify as Tier 1 evidence. If a claim can be supported by Tier 1 evidence, that’s what we use first.
Tier 2: Major Reporting and Research
Reporting from established outlets like the New York Times, Wired, the BBC, and the Washington Post falls into Tier 2. So does peer-reviewed academic research, government publications, and reports from organizations like the Center for Humane Technology. We cite the source directly and link to it when we use Tier 2 evidence. These sources can support strong safety claims when combined with Tier 1 evidence or when Tier 1 sources are unavailable for a specific issue.
Tier 3: User Review Patterns
App store reviews and community discussions on Reddit can reveal real problems. But a single angry review isn’t evidence of a systemic issue. We set minimum thresholds: at least 10 app store reviews documenting the same problem, or at least 5 independent Reddit posts describing the same experience. When Tier 3 evidence meets those thresholds, we publish it using “pattern of reports” language. We never present user patterns as definitive fact.
For app store review analysis specifically, we pull 200 or more reviews per store within a 12-month window. We categorize reviews by topic (privacy concerns, billing issues, content problems, technical bugs) and track the frequency of each category. This tells us whether a complaint is a one-off or something dozens of people are experiencing.
Tier 4: Not Publishable
Single social media posts, lone Reddit complaints, anonymous claims, and unverified screenshots never make it into our published content on their own. They might prompt us to investigate further, but they don’t count as publishable evidence regardless of how alarming they sound. This protects both the apps we review and the people reading our reviews.
What We Analyze in Each Source
Collecting documents is only the starting point. Here’s what we look for in each category of evidence.
Privacy Policy Deep Dive
We read the full privacy policy and document specific answers to these questions: What personal data does the app collect? Does it share data with third parties, and if so, which ones? How long does it retain your data after you stop using the app? Can you request deletion, and how? Does the app use your conversations to train its AI models? What’s its stated compliance with GDPR and CCPA? Vague language gets noted as a concern. If a policy says it “may” share data with “partners” without naming them, we flag that ambiguity in our rating.
Terms of Service Review
Terms of service documents can run long. Some exceed 30,000 words. We focus on the clauses that directly affect users: Who owns the intellectual property in conversations you have with the AI? Under what conditions can the company terminate your account without notice? Is there a mandatory arbitration clause that limits your legal options? What age requirement does the app enforce, and how strictly? What content restrictions exist? When a ToS is too large for complete single-pass analysis, we document what we confirmed and what remains unverified.
App Store Review Analysis
We use a structured approach to app store reviews. For each app, we pull reviews from both Google Play and the Apple App Store using automated collection tools. Our target is 200 or more reviews per store within a 12-month window, giving us 400 or more data points per app. We sort reviews by topic, identify recurring complaints, and calculate what percentage of reviews mention each issue. A privacy concern that shows up in 3 out of 200 reviews is different from one that shows up in 40. Both get documented, but they carry different weight in our analysis.
Regulatory and Incident Search
This is where some of the most consequential findings surface. For every app we rate, we run at least two mandatory searches: one for fines, penalties, investigations, and bans, and another specifically for GDPR, FTC, and COPPA actions. These searches have uncovered multi-million-euro fines, formal government complaints, and legislative inquiries that don’t appear anywhere in an app’s own documentation. An app can have a polished safety page and still have a history of regulatory violations. We check both.
How We Handle Incomplete Evidence
Not every app publishes a safety page. Not every privacy policy explains data retention clearly. Some companies operate from jurisdictions where regulatory filings aren’t publicly accessible. We don’t pretend these gaps don’t exist.
When evidence is unavailable, we document it as “Investigated, not found” along with the specific searches we attempted. This matters because missing information is not the same as positive information. If an app doesn’t disclose how long it retains your conversation data, it doesn’t get credit for having a good retention policy. The absence of evidence in areas where we’d expect transparency is itself a data point that affects the score.
Partial evidence gets its own label. If a terms of service document was too large for complete analysis, we note exactly which sections we verified and which remain unconfirmed. Our readers deserve to know the boundaries of what we’ve checked.
Refresh Cadence: How Often We Re-Gather Evidence
Evidence goes stale. A privacy policy reviewed three months ago might have changed yesterday. We maintain a tiered refresh schedule based on each app’s traffic and profile.
- Tier 1 apps (the most popular, including Replika, Character.AI, Nomi, Kindroid, and Candy AI): Full evidence re-check every month.
- Tier 2 apps (apps ranked 6 through 15 by traffic): Quarterly evidence refresh.
- Tier 3 apps (long-tail and lower-traffic apps): Semi-annual review.
- Breaking updates for any app: Within 48 hours of a pricing change, safety incident, regulatory action, terms or privacy policy change, or app shutdown or acquisition. For an example of how documented security incidents shape our scoring, see our AI Dungeon safety rating.
Every review and safety rating page on CompanionWise shows a “Last Reviewed” date above the fold. When a score changes, we display the previous score, the new score, and the reason for the change. You shouldn’t have to guess whether our information is current.
The Hard Rule for Strong Claims
Some claims carry more weight than others. When we say an app “manipulates” users, is “dangerous,” is “unsafe for minors,” or is “exploitative,” we’re making a serious assertion that could affect a company’s reputation and a user’s trust. Those claims require Tier 1 or Tier 2 evidence. Period.
Tier 3 evidence (user review patterns) alone cannot support a strong claim. It can add context to a Tier 1 finding. It can strengthen a pattern already documented by a Tier 2 source. But it can’t stand on its own for language that carries serious implications.
Our full source hierarchy, from strongest to weakest: direct policy text, official safety documentation, Tier 1 regulatory actions, Tier 2 major news reporting, academic research, Tier 3 app store patterns, Tier 3 Reddit patterns, and YouTube commentary. When sources conflict, we go with the higher-tier source and note the discrepancy.
How This Connects to Our Ratings
Evidence standards are the input layer. They determine what data feeds into our scoring process, but they don’t determine the scores themselves. For how we translate evidence into the six safety dimensions and an overall rating, see our safety rating methodology. For how evidence and scores shape the full written review of each app, see how we review.
If you believe we’ve gotten a fact wrong or used outdated evidence, our corrections page explains how to submit a factual dispute. For a comparison of apps based on this evidence, see our Character AI alternatives. We take corrections seriously. Our credibility depends on it.