Agents can query databases: this has always been the dream of many data people! Sure, SQL resembles English, so it makes it relatively easy to express your questions and get answers. The key here is relatively. You still need to think about your schema, which tables need to be joined to gather the information you need, recall column names, and follow a precise syntax.
Agents save you that effort. You express your query in English, you get a result. Well, you might still have to wait if you have a lot of data and the query is complex, but during that time you can switch to other tasks. It feels like waving a magic wand and seeing your desires materialise… but it’s not exactly like that. Because both you and LLMs can make mistakes, especially if your schema isn’t clear.
And that’s why I tell people that their schemas should be prepared for agents. Let’s see what this means.
Security
The first thing to address is security. Make sure that agents and tools that read data can only access data they should be able to access, and they can only perform the authorised operations. For example, typically they must not be able to add, modify or delete any data. And if your database contains PII data (such as customer names, anagraphic data and contacts), you typically don’t want your AI to be able to query those data.
In some cases, the data that the agent or tool should be able to see depend on the user. A sales representative might be authorised to see her own customers and performance data, but not her colleagues data. A member of the legal team might be authorised to see contracts that other users shouldn’t see.
Databases have all the features you need to implement this level of security. It’s important that the agent or tool uses a dedicated database user, or if necessary, a database user that is dedicated to the AI’s current human user. Databases support different permissions for every type of supported operations – typically, you only want to grant AI the SELECT permission. That permission can be granted at schema, table, or column level. Implementing row-level permission is slightly more complex, but feasable. If the DBMS you use doesn’t support a specific way to do this, you can still implement row-based permissions using stored procedures and views.
Configuration
If you’re not using a database designed for analytical workloads (such as ClickHouse, Snowflake or Amazon Redshift) you should really configure it to optimise complex, long-running queries. We won’t dig into the details, because they greatly depend on which database you use.
Regardless which technology you use, it’s important to make sure that:
- Your queries timeouts are long enough to permit analytical queries. I often insist that, for OLTP, granting all queries a timeout that is longer than 5 seconds is unreasonable. The user has already left the page, and the only effect of keeping the query running is consuming more database resources. But for analytical workloads, it would be utopic to assume than 10 minutes is a long time.
- On the other hand, enough is enough. A query cannot run for days and make the database unresponsive or too slow for all other users. There should be a reasonable timeout. Some technologies or external tools allow to create more complex rules, and kill the queries that are using too many resources.
Database Design
Every table must express an entity or a relationship between entities, in a standard, clear, and clean way.
When a table is a sparse matrix, the relationships between the objects it contains are unclear. When a table contains multiple entities, its meaning itself is unclear.
Consider Odoo‘s schema. Odoo is a most important open source CRM, and it’s currently growing very fast in Europe. But – sorry to be blunt here – its PostgreSQL schema is poorly designed. Tables often have multiple meanings. Or they have one meaning for the developers, but this doesn’t match the customers’ business objects and processes.
For example, both customers and suppliers were stored in the res_partner table. In older versions, they could be distinguished thanks to boolean columns called customer and supplier. In newer versions, it got worse: to find customers we need to filter by customer_rank > 0, and to find suppliers we need to filter by supplier_rank > 0. While LLMs tend to know this, expect them to make many mistakes when you ask them to write complex queries that involve multiple unclear tables.
Clear Naming
Unclear naming can easily confuse models. If a column tells you how many products of a certain type are in a certain warehouse, that column should be named quantity. It’s also acceptable to call it qty, because it’s a common abbreviation that models have encountered many times during their training. Calling it count can occasionally lead to confusion. Calling it num or n is a great way to get wrong results – and yes, I’ve seen columns called num or n.
The model needs to know the type of columns, too, so it can produce correct syntax ( text_col='text' AND int_col=1). If the typing is confusing, it can lead to errors. When naming is acceptable but not very clear, wrong typing can mislead models. For example, dob usually means Date of Birth, but it can have many other meanings. At least one of which relates to people too: Degree of Burglary. Using the proper type can avoid funny mistakes.
See also my article Why Your Database Deserves Consistent Names and Types.
Comments
Tables, columns, views, and basically every schema object can have a comment. Use comments whenever the meaning of a column isn’t obvious. Use it to clarify acronyms, to indicate the format of text columns, and above all to clarify obscure cases.
Don’t use comments when the meaning of something cannot realistically confuse the reader, avoiding added noise. Remember: humans and models need to focus their attention towards the right things to avoid making mistakes. Noise makes this difficult.
Views
Views don’t eliminate complexity, they add up to existing complexity. The main undesirable consequences of this are that:
- The query planner / optimiser might ignore some indexes in the underlying tables, making your queries unnecessarily slow. In technical terms: views can prevent a predicate pushdown, forcing a view materialisation.
- If models still learn about the underlying tables, they can get confused by multiple levels of complexity.
That said, if used wisely, views can make some queries simpler to create. If the model can avoid writing a JOIN involving many tables, filters and aggregations by using a single view, this will increase the chances that the rest of the query is correct and clean.
You can also use views to hide sensitive information from a model.
Documentation and Data Dictionary
Schemas need to be documented. Agents didn’t eliminate this need, they made it more urgent.
Suppose you ask “which warehouses are critically full?”. Maybe your schema has confusing naming, and warehouses are in a table called store. Maybe “critically full” is a precise concept used by your team, but its explanation can’t be found in the data. Maybe it even has a different meaning for different teams (far from uncommon in medium/big companies).
In the simplest cases, it’s usually desirable to send some schema documentation to the models using the system prompt – the initial message that the agent sends to the model, followed by anything the user wrote. If you have dozens of tables, this is unlikely to be sufficient.
For more complex cases, the system prompt should only include:
- A generic explanation of the business domain and the schema.
- A dictionary of the business terms that might be found in the user’s request. If necessary, this dictionary should vary from team to team.
- Documentation of naming rules that are used consistently across the schema.
- Documentation of the most common tables – the ones that have a good chance of being referenced in a query.
- A list of entity groups – for example: customer entities, inventory entities, payment entities, etc. Entities are tables and views. Some entities might be part of multiple groups.
With this information, the model should understand:
- How to compose very common queries.
- Which groups it needs more information about to fulfill a user request.
The model will be instructed to let the agent know, in a machine-readable format, if it needs informaiton about some entity groups, and which ones. When this happens, the model will be provided with additional documentation. In the AI jargon, these pieces of additional documentation are called resources.
SQL Advanced Features and Dialects
SQL has many advanced features that are incredibly powerful, but not widely used. To make things worse, every dialect has non-standard variations that normally work on one DBMS only. This often confuses models. I’ve seen cases where they suggest using MariaDB temporal tables features on PostgreSQL, or where they use PostgreSQL type conversion syntax in a query for MySQL.
It is a good idea to use a set of agent skills or MCP resources to inform the model about the syntaxes that are supported by the DBMS you use. For example, MariaDB Foundation maintains a set of MariaDB skills that you can easily make available to your agents.
Semantic Layers
There is a movement in the data industry toward Semantic Layers (like dbt Semantic Layer, Cube, LookML). You don’t have to necessarily adopt one of them, as you can build your semantics in a more reliable, cheaper and cleaner way by following the good practices listed here. However, I want to mention that semantic layers exist for the sake of completeness.
Training
This is the last link of the chain, but make no mistake: you want to make sure it’s not the weakest! A technology can’t save humans from using it in a bad way.
It’s important to train AI users. They need to know:
- How to formulate a good question;
- How to make sure the model understands a question;
- How to make sure the model has enough information to emit an informed answer;
- How to make sure the model has made a correct reasoning;
- How to spot hallucinations.
This is important to avoid the trap of AI’s hallucinations.
AI training is not part of our services, but we can direct you to reliable partners.
Conclusions
Many schemas are far from having a clean, self-documenting structure. You need to make sure that models understand user requests and translate them into correct SQL by fixing the schema where needed, and by adding relevant documentation. Changing the schema is better in terms of correctness of the results and token saving, but the required effort is not always reasonable. With clear, precise, well-structured documentation it is still possible to obtain good SQL queries from a model.
We can help you prepare your schemas for agents. Contact us to discuss what we can do.
Federico Razzoli



0 Comments