Almost every AI researcher has studied or conducted a large number of AI research papers. So, it's quite logical that researchers are trying to create AI systems to help conduct research. Creating scientific research could be much easier and more varied if we use LLMs and AI assistants tailored for this purpose. Just imagine how interesting it would be to read high-quality research about AI made by an AI agent.
Today, we offer you to explore these 10 AI systems for scientific research:
Almost every AI researcher has studied or conducted a large number of AI research papers. So, it's quite logical that researchers are trying to create AI systems to help conduct research. Creating scientific research could be much easier and more varied if we use LLMs and AI assistants tailored for this purpose. Just imagine how interesting it would be to read high-quality research about AI made by an AI agent.
Today, we offer you to explore these 10 AI systems for scientific research:
Interesting Solution to the Problem of Misguided Attention
So I've been fascinated by the problem of Misguided Attention for a few weeks. I am trying to build an inference algorithm to help LLMs address that issue; but in the process, I found a cool short-term fix I call "Mindful Attention" using just prompt-engineering.
Have you ever thought about how our brains filter reality through layers of past experiences, concepts, and mental images? For example, when you look at an oak tree, are you truly seeing that oak tree in all its unique details, or are you overlaying it with a generalized idea of "oak tree"? This phenomenon inspired the new approach.
LLMs often fall into a similar trap, hence the Misguided Attention problem. They process input not as it’s uniquely presented but through patterns and templates they’ve seen before. This leads to responses that can feel "off," like missing the point of a carefully crafted prompt or defaulting to familiar but irrelevant solutions.
I wanted to address this head-on by encouraging LLMs to slow down, focus, and engage directly with the input—free of assumptions. This is the core of the Mindful Attention Directive, a prompt designed to steer models away from over-generalization and back into the moment.
And if you want to try this mindful approach in action, check out the LLM I’ve set up for testing: https://hf.co/chat/assistant/677e7ebcb0f26b87340f032e. It works about 80% of the time to counteract these issues, and the results are pretty cool.
I'll add the Gist with the full prompt. I admit, it is quite verbose but it's the most effective one I have landed on yet. I am working on a smaller version that can be appended to any System Prompt to harness the Mindful Attention. Feel free to experiment to find a better version for the community!
10 Free Comprehensive Datasets for Supervised Fine-Tuning
High-quality datasets, their size and relevance directly impact the effectiveness of fine-tuning and the models' real-world applications. Among the numerous datasets for different tasks, it can be challenging to choose the most comprehensive dataset that best suits your purposes.
So today, we invite you to explore top 10 free datasets on natural language processing and maths:
1. fka/awesome-chatgpt-prompts proposes a huge variety of prompts that can be used with ChatGPT. Over 700 models were trained on this dataset.
2. HuggingFaceFW/fineweb from Hugging Face includes 15T tokens of cleaned and deduplicated English web data. It’s suitable for LLM training, benchmarking, model validation.
3. HuggingFaceFW/fineweb-2 is an another version of FineWeb with high-quality pretraining data to over 1000 languages.
4. O1-OPEN/OpenO1-SFT with Chinese and English data can be used for Chain-of-Thought activation.
5. yahma/alpaca-cleaned is a curated version of the original Alpaca Dataset released by Stanford.
6. lmsys/lmsys-chat-1m with 1 million real-world conversations with 25 state-of-the-art LLMs offers diverse use cases, like content moderation, safety benchmarks, and training instruction-following models.
7. allenai/dolma from Allen AI includes 3T tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
Math datasets:
1. HuggingFaceTB/finemath consists of educational math content and has two versions: 34B tokens and 54B tokens.
This year, we started our “AI Agents and Agentic Workflows” series (https://www.turingpost.com/t/AI-Agents) to explore everything about AI agents step by step: all the vocabulary, how they work, and how to build them. The huge interest in this series and the large number of studies conducted on agents showed that it was one of the most popular and important themes of the year. In 2025, most likely, agents will reach new highs – we will be covering that for you. Now, let’s review the agentic systems that have emerged this year.
Here is a list of 15 agentic systems and frameworks of 2024: