I like to train large deep neural nets too 🧠🤖💥 | First Paper (AutoAgents: A Framework for Automatic Agent Generation) Accepted @ IJCAI 2024 | Role Model Karpathy
minimal single script implementation of knowledge distillation in LLMs. In this implementation, we use GPT-2 (124M) as student model and GPT-2 Medium (340M) as teacher via reverse Kullback-Leibler (KL) divergence, trained on a small chunk of openwebtext.
damn I love nvidia's bullish stance on taking AI to the edge - from being the overlord of compute to cutting-edge physical AI with SOTA multiverse simulation engines that brings the scaling laws under your control!!
My favorite: Cosmos - fully opensourced, open-weight physics based video gen platform, what an incredible way to start off the year✨
nanoBLT: Simplified lightweight implementation of a character-level Byte Latent Transformer model (under 500 lines of code). The model is 2x4x2 (n_layers_encoder, n_layers_latent, n_layers_decoder) layer deep trained on ~1M bytes of tiny Shakespeare with a patch size of 4.
Implements from first-principle a discrete flow matching model for code generation- trained a small sized 2D dfm model on two variations of code for binary search. The result was amazing, code in comment: Code: https://github.com/Jaykef/ai-algorithms/blob/main/dfm.ipynb
In Honour of This Year's NeurIPs Test of Time Paper Awardees This year's NIPs Test of Time Paper Awards went to two groundbreaking papers: 1. Generative Adversarial Nets (Goodfellow et al) 2. Sequence to Sequence Learning with Neural Networks (Ilya et al) Let's explore how these papers helped pioneered breakthroughs in today's AI:
Lightweight implementation of the seminal paper “Sequence to Sequence Learning with Neural Networks”
Built, trained and eval a 2 layer deep seq2seq LSTM-based model (~10M params) on German-English corpus of Multi30K dataset. In honor of ilya sutskever et al for winning this year’s NeurIPSConf Test of Time paper award 🫡
Rethinking Backpropagation: Thoughts on What's Wrong with Backpropagation
As a young researcher, I've often pondered the limitations of backpropagation, especially when mapped with how learning occurs in the human brain. While backpropagation has been the workhorse of deep learning, it isn't without flaws. In this post, I aim to share some thoughts on these shortcomings from first principles.
Implements compute-efficient DeepPCR algorithm which parallelizes sequential operations thus speeding up inference and training of neural networks. DeepPCR can significantly reduce the time complexity in operations such as denoising in latent diffusion space from O(L) to O(log2 L).
Here we implement the seminal RNN paper “Generating Text with Recurrent Neural Networks"- we train a character-level multiplicative recurrent neural network model (~250k params) for 1000 epochs with Adam opt on 2pac's "Hit 'em Up", sample was fun lol.
Interesting Work on Reasoning 🤔 - explores a new take on few-shot reasoning while challenging assumptions that program synthesis is necessary for abstract reasoning. - shows test-time training + smart inference tricks can match human-average performance, though at high computational cost. Key insight: proper compute allocation matters more than method (whether symbolic or neural).
It's work like this that in some way signal the eventual “dominance” of AI over all the sciences.
“We train our model on the six-dimensional N-body phase space, predicting particle velocities as the time derivative of the model’s displacement outputs”
The emulator is capable of predicting the nonlinear displacement and velocity fields for 128^3 particles in half a second on a single GPU🤯
Triton nanoGPT now has a custom cross entropy loss kernel 🚀 Next: matmul, gradually overthrowing all major PyTorch ops:)
Simplified pseudo for parallel cross-entropy loss compute: - init program: get pid, compute offsets, load targets. - init row_max and row_sum. - for-loop1 (find max logits): update row_max with max logits. - for-loop2 (compute softmax and loss): compute row_sum, update loss. - add log(row_sum) and store loss.
Lightweight implementation of newly introduced “Differential Transformer”: Proposes differential attention mechanism which computes attention scores as a difference between two separate softmax attention maps thereby reducing noise in attention blocks. [[[Differential nanoGPT]]] :)
Open-source AI creates healthy competition in a field where natural tendencies lead to extreme concentration of power. Imagine a world where only one or two companies could build software. This is the biggest risk and ethical challenge of them all IMO. Let's fight this!
Very few people realize that most of the successful AI startups got successful because they were focused on open science and open-source for at least their first few years. To name but a few, OpenAI (GPT, GPT2 was open-source), Runway & Stability (stable diffusion), Cohere, Mistral and of course Hugging Face!
The reasons are not just altruistic, it's also because sharing your science and your models pushes you to build AI faster (which is key in a fast-moving domain like AI), attracts the best scientists & engineers and generates much more visibility, usage and community contributions than if you were 100% closed-source. The same applies to big tech companies as we're seeing with Meta and Google!
More startups and companies should release research & open-source AI, it's not just good for the world but also increases their probability of success!