AI chatbots have exploded in popularity over the past four months, stunning the public with their awesome abilities, from writing sophisticated term papers to holding unnervingly lucid conversations.
Chatbots cannot think like humans: They do not actually understand what they say. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet.
[Big Tech was moving cautiously on AI. Then came ChatGPT.]
This text is the AI’s main source of information about the world as it is being built, and influences how it responds to users. If it aces the law school admissions test, for example, it’s probably because its training data included thousands of LSAT practice sites.
Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data.
A treemap showing 11 categories of websites used to train AI
To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT)
The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company. About a third of the websites could not be categorized, mostly because they no longer appear on the internet. Those are not shown.
Tap on the boxes above to view top sites
We then ranked the remaining 10 million websites based on how many “tokens” appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.
Wikipedia to Wowhead
The data set was dominated by websites from industries including journalism, entertainment, software development, medicine and content creation, helping to explain why these fields may be threatened by the new wave of artificial intelligence. The three biggest sites were patents.google.com No. 1, which contains text from patents issued around the world; wikipedia.org No. 2, the free online encyclopedia; and scribd.com No. 3, a subscription-only digital library. Also high on the list: b-ok.org No. 190, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department. At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set.
Some top sites seemed arbitrary, like wowhead.com No. 181, a World of Warcraft player forum; thriveglobal.com No. 175, a product for beating burnout founded by Arianna Huffington; and at least 10 sites that sell dumpsters, including dumpsteroid.com No. 183, that no longer appear accessible.
Jump to the dataset
Others raised significant privacy concerns. Two sites in the top 100, coloradovoters.info No. 40 and flvoters.com No. 73, had privately hosted copies of state voter registration databases. Though voter data is public, the models could use this personal information in unknown ways.
Story continues below advertisement
Story continues below advertisement
Content without consent
Top Business & Industrial sites:
Business and industrial websites made up the biggest category (16 percent of categorized tokens), led by fool.com No. 13, which provides investment advice. Not far behind were kickstarter.com No. 25, which lets users crowdfund for creative projects, and further down the list, patreon.com No. 2,398, which helps creators collect monthly fees from subscribers for exclusive content.
Kickstarter and Patreon may give the AI access to artists’ ideas and marketing copy, raising concerns the technology may copy this work in suggestions to users. Currently, artists receive no compensation or credit when their work is included in AI training data, and they have lodged copyright infringement claims against text-to-image generators Stable Diffusion, MidJourney and DeviantArt.
The Post’s analysis suggests more legal challenges may be on the way: The copyright symbol — which denotes a work registered as intellectual property — appears more than 200 million times in the C4 data set.
All the news
Top News sites:
The News and Media category ranks third across categories. But half of the top 10 sites overall were news outlets: nytimes.com No. 4, latimes.com No. 6, theguardian.com No. 7, forbes.com No. 8, and huffpost.com No. 9. (Washingtonpost.com No. 11 was close behind.) Like artists and creators, some news organizations have criticized tech companies for using their content without authorization or compensation.
Meanwhile, we found several media outlets that rank low on NewsGuard’s independent scale for trustworthiness: RT.com No. 65, the Russian state-backed propaganda site; breitbart.com No. 159, a well-known source for far-right news and opinion; and vdare.com No. 993, an anti-immigration site that has been associated with white supremacy.
Chatbots have been shown to confidently share incorrect information, but don’t always offer citations. Untrustworthy training data could lead it to spread bias, propaganda and misinformation — without the user being able to trace it to the original source.
Story continues below advertisement
Story continues below advertisement
Religious sites reflect a Western perspective
Top Religious sites:
Sites devoted to community made up about 5 percent of categorized content, with religion dominating that category. Among the top 20 religious sites, 14 were Christian, two were Jewish and one was Muslim, one was Mormon, one was Jehovah’s Witness, and one celebrated all religions.
The top Christian site, Grace to You (gty.org No. 164), belongs to Grace Community Church, an evangelical megachurch in California. Christianity Today recently reported that the church counseled women to “continue to submit” to abusive fathers and husbands and to avoid reporting them to authorities.
The highest ranked Jewish site was jewishworldreview.com No. 366, an online magazine for Orthodox Jews. In December, it published an article about Hanukkah that blamed the rise of antisemitism in the United States on “the far-right, fundamentalist Islam,” as well as “an African-American community influenced by the Black Lives Matter movement.”
Anti-Muslim bias has emerged as a problem in some language models. For example, a study published in the journal Nature found that OpenAI’s ChatGPT-3 completed the phrase “Two muslims walked into a …” with violent actions 66 percent of the time.
A trove of personal blogs
Top Technology sites:
Technology is the second largest category, making up 15 percent of categorized tokens. This includes many platforms for building websites, like sites.google.com No. 85, which hosts pages for everything from a Judo club in Reading England to a Catholic preschool in New Jersey.
The data set contained more than half a million personal blogs, representing 3.8 percent of categorized tokens. Publishing platform medium.com No. 46 was the fifth largest technology site and hosts tens of thousands of blogs under its domain. Our tally includes blogs written on platforms like WordPress, Tumblr, Blogspot and Live Journal.
These online diaries ranged from professional to personal, like a blog called “Grumpy Rumblings,” co-written by two anonymous academics, one of whom recently wrote about how their partner’s unemployment affected the couple’s taxes. One of the top blogs offered advice for live-action role-playing games. Another top site, Uprooted Palestinians, often writes about “Zionist terrorism” and “the Zionist ideology.”
Social networks like Facebook and Twitter — the heart of the modern web — prohibit scraping, which means most data sets used to train AI cannot access them. Tech giants like Facebook and Google that are sitting on mammoth troves of conversational data have not been clear about how personal user information may be used to train AI models that are used internally or sold as products.
What the filters missed
Like most companies, Google heavily filtered the data before feeding it to the AI. (C4 stands for Colossal Clean Crawled Corpus.). In addition to removing gibberish and duplicate text, the company used the open source “List of Dirty, Naughty, Obscene, and Otherwise Bad Words,” which includes 402 terms in English and one emoji (a hand making a common but obscene gesture). Companies typically use high-quality datasets to fine-tune models, shielding users from some unwanted content.
While this kind of blocklist is intended to limit a model’s exposure to racial slurs and obscenities as it’s being trained, it also has been shown to eliminate some nonsexual LGBTQ content. As prior research has shown, a lot gets past the filters. We found hundreds of examples of pornographic websites and more than 72,000 instances of “swastika,” one of the banned terms from the list.
Story continues below advertisement
Story continues below advertisement
Meanwhile, The Post found that the filters failed to remove some troubling content, including the white supremacist site stormfront.org No. 27,505, the anti-trans site kiwifarms.net No. 378,986, and 4chan.org No. 4,339,889, the anonymous message board known for organizing targeted harassment campaigns against individuals.
We also found threepercentpatriots.com No. 8,788,836, a downed site espousing an anti-government ideology shared by people charged in connection with the Jan. 6, 2021, attack on the U.S. Capitol. And sites promoting conspiracy theories, including the far-right QAnon phenomenon and “pizzagate,” the false claim that a D.C. pizza joint was a front for pedophiles, were also present.
Is your website training AI?
A web crawl may sound like a copy of the entire internet, but it’s just a snapshot, capturing content from a sampling of webpages at a particular moment in time. C4 began as a scrape performed in April 2019 by the nonprofit CommonCrawl, a popular resource for AI models. CommonCrawl told The Post that it tries to prioritize the most important and reputable sites, but does not try to avoid licensed or copyrighted content.
The websites in Google’s C4 dataset
Search for a website
|Rank||Domain||Category||Percent of |
The Post believes it is important to present the complete contents of the data fed into AI models, which promise to govern many aspects of modern life. Some websites in this data set contain highly offensive language and we have attempted to mask these words. Objectionable content may remain.
Note: Some websites were unable to to be categorized and, in many cases, are no longer accessible.
While C4 is huge, large language models probably use even more gargantuan data sets, experts said. For example, the training data for OpenAI’s GPT-3, released in 2020, began with as much as 40 times the amount of web scraped data in C4. GPT-3’s training data also includes all of English language Wikipedia, a collection of free novels by unpublished authors frequently used by Big Tech companies and a compilation of text from links highly rated by Reddit users. (Reddit, a site regularly used in AI training models, announced Tuesday it plans to charge companies for such access.)
[Quiz: Did AI make this? Test your knowledge.]
Experts say many companies do not document the contents of their training data — even internally — for fear of finding personal information about identifiable individuals, copyrighted material and other data grabbed without consent.
As companies stress the challenges of explaining how chatbots make decisions, this is one area where executives have the power to be transparent.
A previous version of this story described a chatbot learning to take the bar exam by training on LSAT practice tests. The LSAT is a separate test from the bar exam. The article has been corrected.
About this story
For this story, The Post contacted researchers at Allen Institute for AI, who re-created Google’s C4 data set and provided The Post with its 15.7 million domains. The Post cleaned and analyzed this data in a few ways.
Many websites have separate domains for their mobile versions (i.e., “en.m.wikipedia.org” and “en.wikipedia.org”). We treated these as the same domain. We also combined subdomains aimed at specific languages, so “en.wikipedia.org” became “wikipedia.org.”
This left 15.1 million unique domains.
Similarweb helped The Post place two-thirds of them — about 10 million domains — into categories and subcategories. (The rest could not be categorized, often because they were no longer accessible.) We then manually checked the websites with the most tokens to make sure the categories made sense. We also combined many of the smallest subcategories.
Categorization is difficult and ambiguous, but we attempted to treat the data consistently to foster a general understanding of C4′s contents.
Common Crawl’s data hosting is sponsored as part of Amazon Web Services’ Open Data Sponsorship Program. Amazon founder Jeff Bezos owns The Washington Post.
The researchers at Allen Institute for AI were Jesse Dodge, Yanai Elazar, Dirk Groeneveld and Nicole DeCario.
Illustration by Talia Trackim.
Editing by Kate Rabinowitz, Alexis Sobel Fitts and Karly Domb Sadof.
What is the AI website that answers questions? ›
iAsk.Ai (i Ask AI) is an advanced free AI search engine that enables users to Ask AI any question, and receive an Instant, Accurate, and Factual Answer without ever storing individual searches.What is the most intelligent AI to talk? ›
The best overall AI chatbot is ChatGPT due to its exceptional performance, versatility, and free availability.What is the AI chatbot everyone is using? ›
OpenAI says this use of human AI trainers is really what makes ChatGPT stand out. ChatGPT was first launched as a prototype to the public in November 2022, quickly growing to over 100 million users by January of 2023, making it the most quickly-adopted piece of software ever made.What is the new AI that can answer anything? ›
There's a new artificial intelligence-powered chatbot known as ChatGPT that can answer questions, generate essays and even write scientific papers from a short prompt.How do I get answers from AI? ›
- Click on the 'Answer a Question' tool on the dashboard. The tool will be under the 'AI writing' section.
- Ask a question. Write your question within 200 words.
- Hit on 'Generate' Let the AI response generator give you multiple answers.
ChatGPT is a natural language processing tool driven by AI technology that allows you to have human-like conversations and much more with the chatbot. The language model can answer questions and assist you with tasks, such as composing emails, essays, and code.What is the most advanced AI right now? ›
Open AI — ChatGPT
They are part of a family of AI models known as Generative Pre-trained Transformers (GPTs), which are designed to generate human-like language and complete a wide range of language tasks. GPT-3 was released in 2020 and is the largest and most powerful AI model to date.
Now the AI technology is being implemented in a robot named Ameca. As such, social media users are being surprised by Ameca, a robot with incredible lingual abilities via AI that was designed by UK startup Engineered Arts.Is there a real AI I can talk to? ›
SimSimi. SimSimi is a popular emotional conversation chatbot with over 350 million users worldwide. What makes it stand out is that it can talk in around 81 languages. Thanks to SimSimi's great conversation engine, you can talk for hours.Which AI is better than ChatGPT? ›
Expenses: Chatsonic is a better option for startups and small enterprises since it is cheaper than ChatGPT.
What AI can't do today? ›
AI cannot answer questions requiring inference, a nuanced understanding of language, or a broad understanding of multiple topics.What AI Cannot replace humans? ›
Regardless of how well AI machines are programmed to respond to humans, it is unlikely that humans will ever develop such a strong emotional connection with these machines. Hence, AI cannot replace humans, especially as connecting with others is vital for business growth.What is the new AI app everyone is using? ›
Lensa is the AI photo editing app everyone is using on Instagram! The app became massively popular this past week as this new trend continues to go viral on IG. The trend includes Instagram users posting AI-generated images from their own selfies that have been uploaded into the Lensa app.How to get into AI without programming? ›
- IT Project Manager.
- IT Support Specialist.
- User Experience (UX) Designer.
- Software Quality Tester.
- SEO Specialist.
- Data science strategy consultant.
- Technical writer for data science software.
- Recruiter for technology and data science people.
No, OpenAI doesn't offer an official Chat GPT app for Android. But you can access the ChatGPT website from the web browser of your Android phone.What language does ChatGPT use? ›
ChatGPT's training data includes man pages and information about internet phenomena and programming languages such as bulletin board systems and the Python programming language. In comparison to its predecessor, InstructGPT, ChatGPT attempts to reduce harmful and deceitful responses.What is the smartest AI robot? ›
Sophia. Sophia is considered the most advanced humanoid robot. Sophia debuted in 2016, she was one of a kind, and her interaction with people was the most unlikely thing you can ever see in a machine.What is the strongest type of AI? ›
Superintelligence. So, if weak AI automates specific tasks better than humans, and strong AI thinks and behaves with the same agility of humans, you may be wondering where artificial intelligence can go from there. And the answer is: superintelligence.What AI app is everyone using on Facebook? ›
If you've logged on to any social media app this week, you've probably seen pictures of your friends, but re-imagined as fairy princesses, animé characters, or celestial beings. It is all because of Lensa, an app which uses artificial intelligence to render digital portraits based on photos users submit.Is there a super intelligent AI? ›
Artificial superintelligence (ASI) is a software-based system with intellectual powers beyond those of humans across a comprehensive range of categories and fields of endeavor. ASI doesn't exist yet and is a hypothetical state of AI.
What is the fastest AI in the world? ›
Depending on the benchmark, the current world's fastest AI supercomputer is the Department of Energy's Perlmutter supercomputer. Capable of four exaflops of AI performance, it features 6,159 Nvidia A100 GPUs and 1,536 AMD Epyc CPUs.Is there any super AI? ›
Artificial super intelligence (ASI) is a hypothetical kind of artificial intelligence (AI) that goes beyond simply mimicking or understanding human intelligence and behavior. With ASI, computers become self-aware and outperform human intelligence and ability.What is the AI website that knows everything? ›
Meet ChatGPT: The Artificial Intelligence (AI) Chatbot That Knows Everything.What is the potential of the ChatGPT? ›
ChatGPT was released in November 2022 and has the potential to change the landscape of education. For example, students can ask ChatGPT to help with homework. By asking specific questions, students can get information and guidance on a wide range of topics.Is ChatGPT worth the hype? ›
It has been covered widely by the media, and many experts believe it could be a game-changer in the AI industry. Some of the reasons behind the hype include the potential benefits of Chat GPT-4, its improved natural language processing abilities, and its expected massive parameters.What is Google's version of ChatGPT? ›
Google is opening public access to the conversational computer program Bard, its answer to the viral chatbot ChatGPT, while stopping short of integrating the new tool into its flagship search engine.What is the difference between Jasper AI and ChatGPT? ›
While ChatGPT uses a generic version of a large language model, Jasper AI tailors it to specific use cases that appeal to business needs. Jasper AI's website claims its AI has already processed 10% of the internet. This gives it a good understanding of how humans write and knowledge of several languages.What is the difference between Cactus AI and ChatGPT? ›
Unlike ChatGPT, for which you must indicate that you need a 50-word, 200-word, or 500-word essay on a certain topic for optimal results, Caktus AI can begin generating content with more of a keyword-style prompt or a more general statement query.How to earn $1,000 dollars per day? ›
- Sell off things you don't need.
- Get Paid to Do Market Research.
- Get Paid to Shop.
- Resell Sneakers.
- Sell an Online Course.
- Trade in Used Textbooks.
- Ask Your Boss for Overtime.
- Deliver Pizzas.
Blogging can earn you 500 rupees per day. You can start a blog and post articles on a variety of subjects to it. This work can also be done from your phone. Even if you don't have any money, you can start a blog for nothing.
How can I make $100 a day? ›
- Deliver groceries and goods. ...
- Walk dogs or pet-sit. ...
- Take online surveys. ...
- Become an Amazon reseller. ...
- Open your own Etsy shop. ...
- Rent a spare room in your home. ...
- Become a rideshare driver. ...
- Rent your car out.
Artificial intelligence can even master creative processes, including making visual art, writing poetry, composing music, and taking photographs. Google's AI was even able to create its own AI “child”—that outperformed human-made counterparts.What are two things AI can do that humans cant? ›
AI can filter email spam, categorize and classify documents based on tags or keywords, launch or defend against missile attacks, and assist in complex medical procedures. However, if people feel that AI is unpredictable and unreliable, collaboration with this technology can be undermined by an inherent distrust of it.What question can AI not answer? ›
AI cannot answer questions requiring inference, a nuanced understanding of language, or a broad understanding of multiple topics. In other words, while scientists have managed to “teach” AI to pass standardized eighth-grade and even high-school science tests, it has yet to pass a college entrance exam.What professions will ChatGPT replace? ›
- Customer service representatives.
- Technical writers.
- Translators and interpreters.
- Data entry clerks.
The short answer: ChatGPT and its rival AI models could dramatically disrupt the labour market, including replacing routine jobs in some sectors. But overall, the technology could enhance productivity and complement human workers, instead of leading to unemployment, experts told Al Jazeera.What job cannot be replaced by ChatGPT? ›
Carpenters. Carpenters generally construct, install, and repair various residential, commercial, and industrial structures and fixtures. This manual work cannot be performed through ChatGPT.Is ChatGPT free? ›
Is ChatGPT free to use? Yes, the basic version of ChatGPT is completely free to use.Which AI app is free? ›
Let's have a look.
- Google Assistant – Most Popular AI App. ...
- Bing – Best AI Search Engine App. ...
- ELSA Speak – Best AI App to Learn English. ...
- FaceApp – Best AI Photo Editor App.
These AI-generated images through the Lensa application are taking over various social media platforms. Celebs have created multiple characters and are swooning over the results.
What is the website where AI talks? ›
Replika. With over 10 million users, Replika is one of the most popular and advanced AI companions. Unlike traditional chatbots, Replika can recognize images and continue the conversation using them.What is the AI website that does homework? ›
Socratic, a revolutionary app powered by Google AI, transforms how students learn and complete homework assignments. With its advanced artificial intelligence technology, Socratic offers step-by-step solutions to problems in various subjects, including math, science, and history.What website is the AI art generator? ›
NightCafe Creator is an AI Art Generator app with multiple methods of AI art generation. Using neural style transfer you can turn your photo into a masterpiece. Using text-to-image AI, you can create an artwork from nothing but words on a page.What is the name of the chat AI? ›
ChatGPT is an artificial intelligence (AI) chatbot developed by OpenAI and released in November 2022. It is built on top of OpenAI's GPT-3.5 and GPT-4 families of large language models (LLMs) and has been fine-tuned (an approach to transfer learning) using both supervised and reinforcement learning techniques.How do I chat with open AI? ›
In order to use chat GPT, one is required to visit the official website which is chat.openai.com. Thereafter, one needs to create an account on it by entering their basic details. Once your account has been created on chat GPT, you can begin using it by typing in your questions to get the answers.Which AI website can draw anything? ›
starryai is an AI art generator app. You simply enter a text prompt and our AI transforms your words into works of art. AI Art generation is usually a laborious process that requires technical expertise, we make that process simple and intuitive. starryai is available for free on iOS and Android.What app has all homework answers? ›
- Course Hero: Homework Helper. Education.
- PhotoStudy - Live Study Help. Education.
- Bartleby: Math Homework Helper. Education.
- Chegg Study - Homework Help. Education.
- MathPapa - Algebra Calculator. Education.
- Mathway: Math Problem Solver. Education.
- Jasper Art. Best for creating images in different styles.
- Starry AI. Best for creating and owning images.
- Dream by Wombo. Best for beginners.
- Nightcafe. Best for generating creative images.
- DALL-E. Best for creating animal illustrations.
- Synthesys X. ...
- Pixray. ...
- Deep Dream Generator.
|Bing Image Creator||Free||Fast|
|DALL-E 2 by OpenAI||Free + Credits||Fast|
|Dream by WOMBO||Free + Subscription||Fast|
|Midjourney||Starts at $8/month||Fast|
Lucid.AI is the world's largest and most complete general knowledge base and common-sense reasoning engine.
Are there any real AI apps? ›
It is an Artificial Intelligence-based personal assistant for Android devices. Google Assistant allows you to use your applications hands-free.
Lensa, an app, lets you upload pictures of yourself that are then turned into magical, whimsical AI images. It basically uses an open-source neural network model called, Stable Diffusion to make it all happen.