The main bottleneck in building GUI agents it to find training data. GUI Agent trajectories are not easy to get by. Crowdsourcing trajectories, then manually annotating them, could be an option, but at scale, it's hard to do
You could use synthetic data generation (ask 1000s small existing GUI agents to solve tasks, keep only successful runs). But then it's hard to come up with many high level-tasks.
โก๏ธ Well, a novel technique was just published that creates a new promising paradigm for synthetic data generation: Shanghai AI Lab researchers propose OS-Genesis, a novel way to create training data for GUI agents that flips the traditional approach on its head. Instead of starting with predefined tasks and having humans or machines execute them, OS-Genesis first explores the interface naturally, then derives meaningful tasks from those interactions.
๐ Exploration-driven vs task-driven approach: โฃ Instead of starting with tasks, OS-Genesis first explores GUIs by clicking and interacting โฃ It then reverse-engineers high-level tasks from successful interaction patterns โฃ This leads to more natural and diverse training data than predefined tasks
๐ฏ Novel reward model for trajectory quality: โฃ Rather than discarding incomplete trajectories, OS-Genesis scores them based on coherence and completion โฃ This preserves valuable partial successes that would otherwise be wasted
๐ Superior results across environments: โฃ Nearly doubles performance on AndroidWorld (9.8% โ 17.4%)
By the way, this field of GUI agents is still in infancy, so you can still make a difference with "low-cost" setups: their paper gets SOTA results with only 8xA100!
Since I published it on GitHub a few days ago, Hugging Face's new agentic library ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ has gathered nearly 4k stars ๐คฏ
โก๏ธ But we are just getting started on agents: so we are hiring an ML Engineer to join me and double down on this effort!
The plan is to build GUI agents: agents that can act on your computer with mouse & keyboard, like Claude Computer Use.
After 6 years, BERT, the workhorse of encoder models, finally gets a replacement: ๐ช๐ฒ๐น๐ฐ๐ผ๐บ๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐ฟ๐ป๐๐๐ฅ๐ง! ๐ค
We talk a lot about โจGenerative AIโจ, meaning "Decoder version of the Transformers architecture", but this is only one of the ways to build LLMs: encoder models, that turn a sentence in a vector, are maybe even more widely used in industry than generative models.
The workhorse for this category has been BERT since its release in 2018 (that's prehistory for LLMs).
It's not a fancy 100B parameters supermodel (just a few hundred millions), but it's an excellent workhorse, kind of a Honda Civic for LLMs.
Many applications use BERT-family models - the top models in this category cumulate millions of downloads on the Hub.
โก๏ธ Now a collaboration between Answer.AI and LightOn just introduced BERT's replacement: ModernBERT.
๐ง๐;๐๐ฅ: ๐๏ธ Architecture changes: โ First, standard modernizations: - Rotary positional embeddings (RoPE) - Replace GeLU with GeGLU, - Use Flash Attention 2 โจ The team also introduced innovative techniques like alternating attention instead of full attention, and sequence packing to get rid of padding overhead.
๐ฅ As a result, the model tops the game of encoder models: It beats previous standard DeBERTaV3 for 1/5th the memory footprint, and runs 4x faster!
๐ฐ๏ธ Llama-3.1-405B took 39 million GPU-hours to train, i.e. about 4.5 thousand years.
๐ด๐ป If they had needed all this time, we would have GPU stories from the time of Pharaoh ๐: "Alas, Lord of Two Lands, the shipment of counting-stones arriving from Cathay was lost to pirates, this shall delay the building of your computing temple by many moons "
๐ ๏ธ But instead, they just parallelized the training on 24k H100s, which made it take just a few months. This required parallelizing across 4 dimensions: data, tensor, context, pipeline. And it is infamously hard to do, making for bloated code repos that hold together only by magic.
๐ค ๐๐๐ ๐ป๐ผ๐ ๐๐ฒ ๐ฑ๐ผ๐ป'๐ ๐ป๐ฒ๐ฒ๐ฑ ๐ต๐๐ด๐ฒ ๐ฟ๐ฒ๐ฝ๐ผ๐ ๐ฎ๐ป๐๐บ๐ผ๐ฟ๐ฒ! Instead of building mega-training codes, Hugging Face colleagues cooked in the other direction, towards tiny 4D parallelism libs. A team has built Nanotron, already widely used in industry. And now a team releases Picotron, a radical approach to code 4D Parallelism in just a few hundred lines of code, a real engineering prowess, making it much easier to understand what's actually happening!
โก ๐๐'๐ ๐๐ถ๐ป๐, ๐๐ฒ๐ ๐ฝ๐ผ๐๐ฒ๐ฟ๐ณ๐๐น: Counting in MFU (Model FLOPs Utilization, how much the model actually uses all the compute potential), this lib reaches ~50% on SmolLM-1.7B model with 8 H100 GPUs, which is really close to what huge libs would reach. (Caution: the team is leading further benchmarks to verify this)
Current LLMs process text by first splitting it into tokens. They use a module named "tokenizer", that -spl-it-s- th-e- te-xt- in-to- arbitrary tokens depending on a fixed dictionnary. On the Hub you can find this dictionary in a model's files under tokenizer.json.
โก๏ธ This process is called BPE tokenization. It is suboptimal, everyone says it. It breaks text into predefined chunks that often fail to capture the nuance of language. But it has been a necessary evil in language models since their inception.
๐ฅ In Byte Latent Transformer (BLT), Meta researchers propose an elegant solution by eliminating tokenization entirely, working directly with raw bytes while maintaining efficiency through dynamic "patches."
This had been tried before with different byte-level tokenizations, but it's the first time that an architecture of this type scales as well as BPE tokenization. And it could mean a real paradigm shift! ๐๐
๐๏ธ ๐๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ: Instead of a lightweight tokenizer, BLT has a lightweight encoder that process raw bytes into patches. Then the patches are processed by the main heavy-duty transformers as we do normally (but for patches of bytes instead of tokens), before converting back to bytes.
๐งฉ ๐๐๐ป๐ฎ๐บ๐ถ๐ฐ ๐ฃ๐ฎ๐๐ฐ๐ต๐ถ๐ป๐ด: Instead of fixed tokens, BLT groups bytes based on their predictability (measured by entropy) - using more compute for complex sequences and efficiently handling simple ones. This allows efficient processing while maintaining byte-level understanding.
I hope this breakthrough is confirmed and we can get rid of all the tokenizer stuff, it will make model handling easier!
๐ฅ ๐๐ผ๐ผ๐ด๐น๐ฒ ๐ฟ๐ฒ๐น๐ฒ๐ฎ๐๐ฒ๐ ๐๐ฒ๐บ๐ถ๐ป๐ถ ๐ฎ.๐ฌ, ๐๐๐ฎ๐ฟ๐๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฎ ๐๐น๐ฎ๐๐ต ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ต๐ฎ๐ ๐๐๐ฒ๐ฎ๐บ๐ฟ๐ผ๐น๐น๐ ๐๐ฃ๐ง-๐ฐ๐ผ ๐ฎ๐ป๐ฑ ๐๐น๐ฎ๐๐ฑ๐ฒ-๐ฏ.๐ฒ ๐ฆ๐ผ๐ป๐ป๐ฒ๐! And they start a huge effort on agentic capabilities.
๐ The performance improvements are crazy for such a fast model: โฃ Gemini 2.0 Flash outperforms the previous 1.5 Pro model at twice the speed โฃ Now supports both input AND output of images, video, audio and text โฃ Can natively use tools like Google Search and execute code
โก๏ธ If the price is on par with previous Flash iteration ($0.30 / M tokens, to compare with GPT-4o's $1.25) the competition will have a big problem with this 4x cheaper model that gets better benchmarks ๐คฏ
๐ค What about the agentic capabilities?
โฃ Project Astra: A universal AI assistant that can use Google Search, Lens and Maps โฃ Project Mariner: A Chrome extension that can complete complex web tasks (83.5% success rate on WebVoyager benchmark, this is really impressive!) โฃ Jules: An AI coding agent that integrates with GitHub workflows
I'll be eagerly awaiting further news from Google!
๐๐๐๐ฅ๐ข๐ง๐ ๐ฅ๐๐ฐ๐ฌ ๐๐ซ๐ ๐ง๐จ๐ญ ๐๐๐๐ ๐ฒ๐๐ญ! New blog post suggests Anthropic might have an extremely strong Opus-3.5 already available, but is not releasing it to keep their edge over the competition. ๐ง
โSince the release of Opus-3.5 has been delayed indefinitely, there have been lots of rumors and articles about LLMs plateauing. Scaling laws, the main powering factor of the LLM competence increase, could have stopped, according to these rumors, being the cause of this stalling of progress.
These rumors were quickly denied by many people at the leading LLM labs, including OpenAI and Anthropic. But these people would be expected to hype the future of LLMs even if scaling laws really plateaued, so the jury is still out.
๐๏ธ This new article by Semianalysis (generally a good source, specifically on hardware) provides a counter-rumor that I find more convincing:
โก๏ธ Maybe scaling laws still work, Opus-3.5 is ready and as good as planned, but they just don't release it because the synthetic data it helps provide can bring cheaper/smaller models Claude and Haiku up in performance, without risking to leak this precious high-quality synthetic data to competitors.
Last week was crazy in OS AI, with important models and datasets releases every day.
Here are the most important ones I've pinned:
๐ Cohere relased GLobal-MMLU, a multilingual version of MMLU, to evaluate AI models' world knowledge in many languages!
๐ฆ Meta released Llama-3.3-70B-Instruct, a 70B model that's on par with Llama-3.1-405B-Instruct, GPT-4o and Claude. Probably my new go-to for agentic workflows.
๐ FishAudio released fish-speech-1.5, multilingual text to speech model
๐จ Microsoft Research released TRELLIS, an extremely impressive image-to-3D model, which you can try here: JeffreyXiang/TRELLIS
๐ Yesterday, Hugging Face release FineWeb 2, a new version that extends the previous FineWeb to over 1000 languages, including extended coverage in Russina, Mandarin, German, Japanese, Spanish, French, so a huge, high-quality dataset of > 3 trillion words! HuggingFaceFW/fineweb-2
Now let's go build to make this week as productive as last one!
A team from NUS and Microsoft just released an agent that can act on any UI (Desktop, Android, Web) without needing additional text information. It works extremely well : they applied their method on a tiny Qwen2-VL-2B, and they managed to beat methods that use either much more powerful vision models (like GPT-4V) without using any additional info (e.g. leveraging the DOM of a webpage) like previous methods did ! ๐๐
They started from the idea that most existing methods rely heavily on text, which makes them less generalizable, while letting aside rich UI structure that user actually rely on when navigating this interfaces.
โ๏ธ They put several good ideas to work:
๐ก Simplify screenshots to the max: They prune a lot the heavy visual content of UI screenshots, by removing cloned image patches (like any vast patch of the same color will be reduced to a small patch, while maintaining positional embeddings), then group patches from the same GUI elements together to simplify even further
๐ก Build a truly generalist dataset: To train a general UI agent, you need trajectories from each possible UI, and express them in a common language. Authors merge datasets like OmniAct for Desktop, Mind2Web for websites, AMEX for Android trajectories to create a high-quality and diverse dataset.
โก๏ธ Nice results ensued: They fine-tune a tiny Qwen-2-VL-2B on their method, and it reaches SOTA on several task (element identification, web navigation), even beating methods that either use additional info from the DOM or use much bigger VLMS like GPT-4v! ๐
And performance could certainly jump with a slightly bigger vision model. Let's hope the community builds this soon! ๐
๐ค ๐๐ฑ๐ผ๐ฏ๐ฒ'๐ ๐ฐ๐ผ๐ฑ๐ฒ-๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ป๐ด ๐ฎ๐ด๐ฒ๐ป๐ ๐ฟ๐ฒ๐ฎ๐ฐ๐ต๐ฒ๐ ๐๐ต๐ฒ ๐๐ผ๐ฝ ๐ผ๐ณ ๐๐๐๐ ๐น๐ฒ๐ฎ๐ฑ๐ฒ๐ฟ๐ฏ๐ผ๐ฎ๐ฟ๐ฑ - and their paper cites my work!
๐ก Reminder:ย In short, Agentic systems are a vehicle in which you put your LLM to allow it access to the outside world.
โก๏ธ The team of researchers at Adobe started from the idea that current agentic systems lack the ability to define their own tools. So they decided to make an agent that writes actions as code, thus allowing it to write python functions that can be re-used later as tools!
Here's what the LLM generations can look like with the proper prompt:
Thought: I need to access the excel file using a different method. Action:
defaccess_excel_file(file_path)
... # rest of the code (the agent does writes it, but I don't have room in this post)return rows
Then your system executes this and appends the observation to the agent's memory.
Why is this code formulation better than classical tool use formulation as JSON? The paper explains:
"Most existing work uses text or JSON as the representation of actions, which significantly lacks the two criteria mentioned earlier: generality and composability. In contrast, DynaSaur can utilize available actions or create new ones if necessary, using code as a unified representation. In principle, acting with code enables agents to solve any Turing-complete problem."
The idea of using code is not new: in fact, we do it in transformers.agents (thus the citation that I got). They implementation adds further refinements, like using RAG to retrieve relevant functions before generating an action, which increases performance further.
And they observe that code agents perform much better, reaching the top of GAIA leaderboard! ๐ฅ
Go take a look, it's really clear and informative!
Menlo Ventures surveyed 600 enterprise IT decision-makers for their 2024 report. They reveal that AI spending surged to $13.8 billion this year, more than 6x the $2.3 billion spent in 2023!
Companies are shifting from experimentation to serious implementation.
๐ท Top enterprise use cases by adoption: โฃ Code copilots (51%) - GitHub Copilot hit $300M revenue run rate โฃ Support chatbots (31%) โฃ RAG (28%) โฃ Data extraction/transformation (27%) โฃ Meeting summarization (25%)
๐ Market dynamics: โฃ OpenAI's enterprise share dropped from 50% to 34% ๐ โฃ Anthropic doubled presence from 12% to 24% ๐ โฃ Open-source makes up 19% of usage ๐ค
๐ฌ Implementation challenges: โฃ 26% failed due to unexpected implementation costs โฃ 21% failed due to data privacy issues โฃ 18% failed due to disappointing ROI โฃ 15% failed due to hallucinations
Made a new app to visualize the LLM race โ ๐ก๐ผ ๐๐๐ฟ๐ผ๐ฝ๐ฒ๐ฎ๐ป ๐ฐ๐ผ๐บ๐ฝ๐ฎ๐ป๐ ๐ถ๐ป ๐๐ต๐ฒ ๐๐ผ๐ฝ ๐ญ๐ฌ ๐ช๐บโ
The outcome is quite sad, as a Frenchman and European.
The top 10 is exclusively US ๐บ๐ธ and Chinese ๐จ๐ณ companies (after great Chinese LLM releases recently, like the Qwen2.5 series), with the notable exception of Mistral AI ๐ซ๐ท.
American companies are making fast progress, Chinese ones even faster. Europe is at risk of being left behind. And the EU AI Act hasn't even come into force yet to slow down the EU market. We need to wake up ๐ฌ
โ ๏ธ Caution: This Chatbot Arena ELO ranking is not the most accurate, especially at high scores like this, because LLM makers can game it to some extent.
Evaluating systems is critical during prototyping and in production, and LLM-as-a-judge has become a standard technique to do it.
First, what is "LLM-as-a-judge"? ๐ It's a very useful technique for evaluating LLM outputs. If anything you're evaluating cannot be properly evaluated with deterministic criteria, like the "politeness" of an LLM output, or how faithful it is to an original source, you can use LLM-judge instead : prompt another LLM with "Here's an LLM output, please rate this on criterion {criterion} on a scale of 1 to 5", then parse the number from its output, and voilร , you get your score.
๐ง But who judges the judge? How can you make sure your LLM-judge is reliable? You can have a specific dataset annotated with scores provided by human judges, and compare how LLM-judge scores correlate with human judge scores.
๐ Before even running that benchmark, to get you started, there's a new option to get you started: a leaderboard that measures how well different model perform as judges!
And the outcome is surprising, models come in quite different orders from what we're used to in general rankings: probably some have much better bias mitigation than others!
๐ Meta teams use a fine-tuned Llama model to fix production issues in seconds
One of Meta's engineering teams shared how they use a fine-tuned small Llama (Llama-2-7B, so not even a very recent model) to identify the root cause of production issues with 42% accuracy.
๐ค 42%, is that not too low? โก๏ธ Usually, whenever there's an issue in production, engineers dive into recent code changes to find the offending commit. At Meta's scale (thousands of daily changes), this is like finding a needle in a haystack. ๐ก So when the LLM-based suggestion is right, it cuts incident resolution time from hours to seconds!
How did they do it?
๐ Two-step approach: โฃ Heuristics (code ownership, directory structure, runtime graphs) reduce thousands of potential changes to a manageable set โฃ Fine-tuned Llama 2 7B ranks the most likely culprits
๐ Training pipeline: โฃ Continued pre-training on Meta's internal docs and wikis โฃ Supervised fine-tuning on past incident investigations โฃ Training data mimicked real-world constraints (2-20 potential changes per incident)
๐ฎ Now future developments await: โฃ Language models could handle more of the incident response workflow (runbooks, mitigation, post-mortems) โฃ Improvements in model reasoning should boost accuracy further
๐จ How green is your model? ๐ฑ Introducing a new feature in the Comparator tool: Environmental Impact for responsible #LLM research! ๐ open-llm-leaderboard/comparator Now, you can not only compare models by performance, but also by their environmental footprint!
๐ The Comparator calculates COโ emissions during evaluation and shows key model characteristics: evaluation score, number of parameters, architecture, precision, type... ๐ ๏ธ Make informed decisions about your model's impact on the planet and join the movement towards greener AI!