Data is key.
Progress in model performance is gated on high-quality data.
Progress in model performance is gated on high-quality data.
Human generated data is important, but collecting it is slow and expensive. More importantly, superintelligence won't come from mimicking humans.
Formal proof systems, simulators, executable tests, and oracle databases can produce more trustworthy data than humans.
Models pre-trained on noisy web-scraped text need information-dense supplements about the natural world. We're projecting the laws of physics, biological facts, self-consistent logic and more into natural language.
Our background is in foundation model training across LLMs, image, video, and speech. We're selective about the engagements we take on.