Company type | Non-profit |
---|---|
Industry | Artificial intelligence |
Founder |
|
Website |
laion |
LAION (acronym for Large-scale Artificial Intelligence Open Network) is a German non-profit which makes open-sourced artificial intelligence models and datasets. [1] It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen. [2] [3]
In February 2023, LAION was named in the Getty Images lawsuit against Stable Diffusion as a non-party. [4] In April 2023, LAION was directly sued by a German photographer who wanted to have his images removed from the training set. [5]
On April 15, 2023, LAION and contributors released to public an open source AI assistant chatbot OpenAssistant.
LAION has publicly released a number of large datasets of image-caption pairs which have been widely used by AI researchers. The data is derived from the
Common Crawl, a dataset of scraped web pages. The developers searched the crawled html for <img>
tags and treated their
alt attributes as captions. They used
CLIP to identify and discard images whose content did not appear to match their captions.
[6] LAION does not host the content of scraped images themselves; rather, the dataset contains
URLs pointing to images, which researchers must download themselves.
[7]
The first such dataset, LAION-400M, was released in August 2021 and consisted of 400 million image-caption pairs. The pairs were extracted from a random subset of webpages scraped by Common Crawl between 2014 and 2021. [8] It was an attempt to recreate the process used by OpenAI to collect the 400 million image-caption pairs they used to train the CLIP model - the company had chosen to open-source the model's code and weights, but not its training dataset. [6] Imagen, a text-to-image model announced by Google Brain in 2022, was trained on LAION-400M in combination with private internal datasets. [9]
A successor of more than 5 billion pairs, LAION-5B, was released in March 2022. [10] As of its release, it was the largest freely available dataset of image-caption pairs in existence. [6] Its creation was funded by Doodlebot, Hugging Face and Stability AI, the AI company behind the funding of the Stable Diffusion text-to-image model, which was trained on it. [11]
Several studies show that the images in LAION-5B contain problematic images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content. [12] [13]
An investigation by Bayerischer Rundfunk showed that LAION's datasets, hosted on Hugging Face, contain large amounts of private and sensitive data. [14]
In December 2023, the Stanford Internet Observatory released a report on LAION-5B that found 3,226 suspected instances of links to child sexual abuse material with 1,008 of these being externally validated. In response, LAION temporarily removed LAION-5B and LAION-400M citing its "zero tolerance policy for illegal content" and "an abundance of caution". [15]
Developer(s) | LAION and contributors |
---|---|
Initial release | 15 April 2023 |
Type | |
License | Apache License 2.0 |
Website |
open-assistant |
OpenAssistant is an artificial intelligence (AI) open source chat-based assistant that understands tasks, can interact with third-party systems and retrieve information dynamically to do so. The project is developed by a group of volunteers in collaboration with LAION. One of the goals for development includes free access to large language models that can be run locally on consumer hardware. [16] [17] The project is backed by a worldwide crowdsourcing effort involving over 13,500 volunteers who have created 600k human-generated data points. [17] [18]
{{
cite journal}}
: Cite journal requires |journal=
(
help)