Speaking SQL: Turning Natural Language into Database Dialogues

Related Blogs
Jakia Kaoser
Akash

Engineering Lead

15 min read

Introduction

Bridging Language and Data in the Enterprise Sphere: In the realm of enterprise data management, SQL databases have long stood as the backbone, housing vast quantities of invaluable information. As enterprises increasingly lean on business intelligence (BI) tools to extract insights from these data repositories, the quest for more intuitive and accessible querying methods gains momentum. The emerging solution? Leveraging Large Language Models (LLMs) to interact with SQL databases using natural language.   This innovative approach, powered by the latest advancements in NLP, promises to transform our interaction with data. However, as groundbreaking as it sounds, integrating LLMs into SQL querying is riddled with challenges, making it a complex and intriguing endeavor.

The Challenges of LLMs for SQL Query Generation:

While LLMs possess an impressive grasp of SQL, enabling them to construct complex queries, several non-trivial issues arise in actual implementations:

  • The Hallucination Dilemma: A significant hurdle is the tendency of LLMs to 'hallucinate' - creating fictitious tables or fields, or generating SQL queries that are incompatible with the actual database structure. This misalignment between the model's output and database reality poses a major obstacle in ensuring the validity and reliability of the queries generated.
  • The Context Window Constraint: LLMs operate within a context window, a limitation on the amount of text they can process at a given time. This becomes particularly problematic in the context of SQL databases, which often contain extensive and complex schemas. The challenge lies in effectively grounding the LLM in the database's reality without overwhelming its context window.
  •  
  • Error Handling and Unpredictability:Despite their capabilities, LLMs are not infallible. The SQL queries they generate may occasionally be erroneous or yield unexpected results. This inconsistency raises critical questions about dependability and error mitigation strategies. Do we accept these limitations or devise mechanisms to counter them?

The High Level Solution

Mimicking Human Expertise for Enhanced LLM Performance: To effectively address the challenges faced by Large Language Models (LLMs) in generating SQL queries, we can draw inspiration from the strategies employed by human data analysts. By emulating the steps a data analyst takes in querying SQL databases, we can guide LLMs towards more accurate and efficient query generation. Let's explore these human-inspired strategies:

  • Sample Queries and Schema Familiarization: Like a data analyst who first familiarizes themselves with the database by executing sample queries and reviewing table schemas, an LLM can benefit from a similar approach. This involves exposing the LLM to the structure and a snapshot of the data it will be working with. By understanding the actual makeup of the database - its tables, fields, and typical data entries - the LLM can generate queries that are more aligned with the reality of the database's structure.
  • Dealing with Information Overload and Context Window Limitation: Data analysts typically don't analyze all data at once; they often start with a subset, examining top rows or summary statistics. This approach can be mimicked to overcome the LLM's context window limitations. Instead of feeding the LLM with the entire database, we can provide it with a distilled version, highlighting key aspects of the data. This selective exposure ensures the LLM remains 'grounded' in the database's reality without overloading its processing capabilities.
  • Error Handling and Iterative Learning: Just as a data analyst doesn't abandon their task after encountering an error, LLMs should also be equipped to learn from mistakes. When an LLM generates an incorrect SQL query, mechanisms can be implemented to analyze the error, provide feedback to the model, and iteratively improve the query. This process not only refines the LLM's query-generation capabilities but also enhances its ability to adapt to different databases and query requirements.
The High Level Solution

The Path Forward:

 

By adopting these human-inspired approaches, we can significantly improve the effectiveness of LLMs in SQL query generation. These strategies not only address the specific challenges of hallucination, context window constraints, and error handling but also pave the way for LLMs to operate more autonomously and reliably in complex data environments. The journey towards integrating LLMs into SQL querying, thus, becomes a blend of technological innovation and cognitive emulation, holding promise for a new era of data interaction.

Langchain Agents

LangChain has a SQL Agent which provides a more flexible way of interacting with SQL Databases than a chain. The main advantages of using the SQL Agent are:

  • It can answer questions based on the databases' schema as well as on the databases' content (like describing a specific table).
  • It can recover from errors by running a generated query, catching the traceback and regenerating it correctly.
  • It can query the database as many times as needed to answer the user question.

To initialize the agent we'll use the create_sql_agent constructor. This agent uses the SQLDatabaseToolkit which contains tools to:

  • Create and execute queries
  • Check query syntax
  • Retrieve table descriptions
  • ... and more
  •  

Coding with Agent

Github Link - here

    Step 1: Creating a Database Schema

    To simulate a real-world scenario, we will create a simple e-commerce database schema encompassing five entities: customer, seller, product, cart, and payment. Each entity will have its own attributes, and we will establish foreign key relationships to introduce complexity. This schema will serve as our testing ground for the LLM's SQL query generation.

    
     import sqlite3
    
    
     db_path = '/content/sample.db'
     conn = sqlite3.connect(db_path)
     cursor = conn.cursor()
    
     script='''
     CREATE TABLE Cart (
         Cart_id TEXT PRIMARY KEY NOT NULL
     );
    
     CREATE TABLE Customer (
         Customer_id TEXT PRIMARY KEY NOT NULL,
         c_pass TEXT NOT NULL,
         Name TEXT NOT NULL,
         Address TEXT NOT NULL,
         Pincode INTEGER NOT NULL,
         Phone_number_s INTEGER NOT NULL,
         Cart_id TEXT NOT NULL,
         FOREIGN KEY(Cart_id) REFERENCES Cart(Cart_id)
     );
    
     CREATE TABLE Seller (
         Seller_id TEXT PRIMARY KEY NOT NULL,
         s_pass TEXT NOT NULL,
         Name TEXT NOT NULL,
         Address TEXT NOT NULL
     );
    
     CREATE TABLE Seller_Phone_num (
         Phone_num INTEGER NOT NULL,
         Seller_id TEXT NOT NULL,
         PRIMARY KEY (Phone_num, Seller_id),
         FOREIGN KEY (Seller_id) REFERENCES Seller(Seller_id) ON DELETE CASCADE
     );
    
     CREATE TABLE Payment (
         payment_id TEXT PRIMARY KEY NOT NULL,
         payment_date DATE NOT NULL,
         Payment_type TEXT NOT NULL,
         Customer_id TEXT NOT NULL,
         Cart_id TEXT NOT NULL,
         total_amount REAL,
         FOREIGN KEY (Customer_id) REFERENCES Customer(Customer_id),
         FOREIGN KEY (Cart_id) REFERENCES Cart(Cart_id)
     );
    
     CREATE TABLE Product (
         Product_id TEXT PRIMARY KEY NOT NULL,
         Type TEXT NOT NULL,
         Color TEXT NOT NULL,
         P_Size TEXT NOT NULL,
         Gender TEXT NOT NULL,
         Commission INTEGER NOT NULL,
         Cost INTEGER NOT NULL,
         Quantity INTEGER NOT NULL,
         Seller_id TEXT,
         FOREIGN KEY (Seller_id) REFERENCES Seller(Seller_id) ON DELETE SET NULL
     );
    
     CREATE TABLE Cart_item (
         Quantity_wished INTEGER NOT NULL,
         Date_Added DATE NOT NULL,
         Cart_id TEXT NOT NULL,
         Product_id TEXT NOT NULL,
         purchased TEXT DEFAULT 'NO',
         PRIMARY KEY (Cart_id, Product_id),
         FOREIGN KEY (Cart_id) REFERENCES Cart(Cart_id),
         FOREIGN KEY (Product_id) REFERENCES Product(Product_id)
     );
    
     '''
    
     cursor.executescript(script)
     conn.commit()
     conn.close()
                    

    Step 2: Installing Dependencies and Connecting to the Database

    We'll need to install several dependencies to facilitate the interaction between our code and the database, as well as to integrate the LLM:

    • langchain: A toolkit that aids in integrating LLMs with various applications.
    • openai: The OpenAI Python package, essential for interacting with GPT models.
    • langchain-openai: An extension to langchain for OpenAI-specific functionalities.
    • langchain-experimental: Provides experimental features in langchain, possibly including advanced integrations and tools.

    We will use the SQLDatabase toolkit provided by langchain to load the model and interface with our SQL database.

    
     !pip install -q langchain openai langchain-openai langchain-experimental faiss-cpu
                    
    
     from langchain.agents import create_sql_agent
     from langchain.agents.agent_toolkits import SQLDatabaseToolkit
     from langchain.sql_database import SQLDatabase
     from langchain_openai import OpenAI
     from langchain.agents import AgentExecutor
     from langchain.agents.agent_types import AgentType
     from langchain.chat_models import ChatOpenAI
                    
    
     from google.colab import userdata
     OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
                    
    
     db = SQLDatabase.from_uri("sqlite:////content/sample.db")
     gpt = ChatOpenAI(openai_api_key=OPENAI_API_KEY,model="gpt-3.5-turbo", temperature=0)
                    

    Step 3: Setting Up an Agent and Querying

    To utilize the LLM for SQL generation, we will set up an agent executor using langchain's create_sql_agent. This agent will serve as the intermediary between our natural language prompts and the SQL database, translating our queries into executable SQL statements.

    • Creating the SQL Agent: Utilize create_sql_agent to initialize our agent with the necessary configurations to interact with our e-commerce database schema.
    • Executing Queries: We will use the agent's ask method to input natural language prompts. The agent will then process these prompts, generate SQL queries, and execute them against our database. The output will be the result of these queries, demonstrating the LLM's ability to interact with SQL databases.

    We will use the SQLDatabase toolkit provided by langchain to load the model and interface with our SQL database.

    
     toolkit = SQLDatabaseToolkit(db=db, llm=gpt)
                    
    
     agent_executor = create_sql_agent(
         llm=gpt,
         toolkit=toolkit,
         verbose=True,
         agent_type="openai-tools",
     )
                    
    
     question = "what is the cart amount for customer with customer id cid101"
     agent_executor.run(question)
                    
    
     Logs
    
     Entering new AgentExecutor chain...
    
     Invoking: `sql_db_list_tables` with `{}`
    
    
     Cart, Cart_item, Customer, Payment, Product, Seller, Seller_Phone_num
     Invoking: `sql_db_schema` with `Cart, Customer`
    
     3 rows from Cart table:
     Cart_id
     crt1011
     crt1012
     crt1013
     */
    
     /*
     3 rows from Customer table:
     Customer_id	c_pass	Name	Address	Pincode	Phone_number_s	Cart_id
     cid100	ABCM1235	rajat	G-453	632014	9893135876	crt1011
     cid101	XYZ5678	john	F-789	560023	9876543210	crt1012
     cid102	PQRS9876	jane	A-321	400001	8765432109	crt1013
     */
     Invoking: `sql_db_query` with `SELECT c.Cart_id, c.Customer_id, c.Name, c.Address, 
     c.Pincode, c.Phone_number_s, SUM(p.Price) AS Cart_Amount FROM Customer c JOIN 
     Cart_item ci ON c.Cart_id = ci.Cart_id JOIN Product p ON ci.Product_id = p.Product_id 
     WHERE c.Customer_id = 'cid101'` 
    
    
     Error: (sqlite3.OperationalError) no such column: p.Price
     [SQL: SELECT c.Cart_id, c.Customer_id, c.Name, c.Address, c.Pincode, 
     c.Phone_number_s, SUM(p.Price) AS Cart_Amount FROM Customer c JOIN Cart_item ci 
     ON c.Cart_id = ci.Cart_id JOIN Product p ON ci.Product_id = p.Product_id WHERE 
     c.Customer_id = 'cid101']
     (Background on this error at: https://sqlalche.me/e/20/e3q8)
     Invoking: `sql_db_schema` with `Product`
    
     /*
     3 rows from Product table:
     Product_id	Type	Color	P_Size	Gender	Commission	Cost	Quantity	Seller_id
     pid1001	jeans	red	32	M	10	10005	20	sid100
     pid1002	shirt	blue	L	F	5	5000	15	sid101
     pid1003	shoes	black	40	M	8	8000	25	sid102
     */
     Invoking: `sql_db_query` with `SELECT c.Cart_id, c.Customer_id, c.Name, c.Address, 
     c.Pincode, c.Phone_number_s, SUM(p.Cost) AS Cart_Amount FROM Customer c JOIN 
     Cart_item ci ON c.Cart_id = ci.Cart_id JOIN Product p ON ci.Product_id = p.Product_id 
     WHERE c.Customer_id = 'cid101'`
    
    
     [('crt1012', 'cid101', 'john', 'F-789', 560023, 9876543210, 5000)]The cart amount 
     for the customer with customer id 'cid101' is 5000.
    
     > Finished chain.
      
    
     'The cart amount for the customer with customer id 'cid101' is 5000.
                    
                    

Coding Using a dynamic few-shot prompt

Github Link - here

Implementing a dynamic few-shot prompt involves several steps. This approach helps optimize the performance of our agent by incorporating domain-specific knowledge and context relevant to the user's input. Here's a breakdown of these steps:

    Step 1: Creating Example Pairs

    Firstly, we need a set of user input and corresponding SQL query examples. These examples should cover a range of typical queries related to our e-commerce schema, providing a base for the LLM to learn from.

    
     examples = [
        {"input": "List all customers.", "query": "SELECT * FROM Customer;"},
        {
            "input": "List all the products from a particular seller with seller_id 
     'sid100'",
            "query": "SELECT Product.Product_id, Product.Type, Product.Color,
     Product.P_Size, Product.Gender, Product.Commission, Product.Cost, Product.Quantity
     FROM Product WHERE Product.Seller_id = 'sid100';",
        },
        {
            "input": "List of all sellers  with name address and their phone number",
            "query": "SELECT Seller.Seller_id, Seller.Name AS Seller_Name, 
     Seller_Phone_num.Phone_num AS Seller_Phone_Number FROM Seller JOIN Seller_Phone_num 
     ON Seller.Seller_id = Seller_Phone_num.Seller_id;",
        },
        {
            "input": "List of all the items in the cart of a customer whose Name is 
     'jane'",
            "query": "SELECT Cart_item.Quantity_wished,  Cart_item.Date_Added, 
     Cart_item.Cart_id, Cart_item.Product_id, Cart_item.purchased FROM Cart_item JOIN 
     Cart ON Cart_item.Cart_id = Cart.Cart_id JOIN Customer ON Cart.Customer_id = 
     Customer.Customer_id WHERE Customer.Name = 'Jane';;",
        },
        {
            "input": "List all customers whose address is Canada.",
            "query": "SELECT * FROM Customer WHERE Country = 'Canada';",
        },
        {
            "input": "How many customers have purchased the product with Product_id 
     pid1002",
            "query": "SELECT COUNT(DISTINCT Customer_id) AS Num_Customers_Purchased FROM 
     Cart_item WHERE Product_id = 'pid1002' AND purchased = 'Y'; ",
        },
        {
            "input": "Find the total number of Products.",
            "query": "SELECT COUNT(*) FROM Product;",
        },
        {
            "input": "How many customers are there",
            "query": 'SELECT COUNT(*) AS Total_Customers FROM "Customer"',
        },
    ]
                    

    Step 2: Example Selector Creation

    We then create an ExampleSelector, specifically a SemanticSimilarityExampleSelector. This selector uses semantic search to find the most relevant examples from our set based on the user's input.

    • Semantic Search: The selector will analyze the semantics of the user's input and compare it with the example pairs to find the most similar ones.
    • Embeddings and Vector Store: It utilizes embeddings and a vector store to perform this semantic analysis efficiently.
    
     from langchain_community.vectorstores import FAISS
     from langchain_core.example_selectors import SemanticSimilarityExampleSelector
     from langchain_openai import OpenAIEmbeddings
    
     example_selector = SemanticSimilarityExampleSelector.from_examples(
         examples,
         OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY),
         FAISS,
         k=5,
         input_keys=["input"],
     )
                    

    Step 3: FewShotPromptTemplate Creation

    Next, we create a FewShotPromptTemplate. This template will integrate the selected examples into a structured prompt that the LLM can understand.

    • Components:
      • Example Selector: Incorporates the previously created SemanticSimilarityExampleSelector.
      • Example Prompt Formatting: A format for how each example should be presented within the prompt.
      • Prefix and Suffix: Text to be added before and after the formatted examples to frame the prompt contextually.
    
     from langchain_core.prompts import (
         ChatPromptTemplate,
         FewShotPromptTemplate,
         MessagesPlaceholder,
         PromptTemplate,
         SystemMessagePromptTemplate,
     )
    
     system_prefix = """You are an agent designed to interact with a SQL 
     database.
     Given an input question, create a syntactically correct {dialect} query 
     to run, then look at the results of the query and return the answer.
     Unless the user specifies a specific number of examples they wish to 
     obtain, always limit your query to at most {top_k} results.
     You can order the results by a relevant column to return the most 
     interesting examples in the database.
     Never query for all the columns from a specific table, only ask for the 
     relevant columns given the question. 
     You have access to tools for interacting with the database.
     Only use the given tools. Only use the information returned by the tools 
     to construct your final answer.
     You MUST double check your query before executing it. If you get an 
     error while executing a query, rewrite the query and try again.
     
     DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to 
     the database.
    
     If the question does not seem related to the database, just return "I 
     don't know" as the answer.
    
     Here are some examples of user inputs and their corresponding SQL 
     queries:"""
    
     few_shot_prompt = FewShotPromptTemplate(
         example_selector=example_selector,
         example_prompt=PromptTemplate.from_template(
             "User input: {input}\nSQL query: {query}"
         ),
         input_variables=["input", "dialect", "top_k"],
         prefix=system_prefix,
         suffix="",
     )
                    

    Step 4: Constructing the Full Prompt

    As we are using an OpenAI tools agent, our full prompt should be structured as a chat prompt.

    • Human Message Template: This is where the user's natural language input will be placed.
    • Agent Scratchpad: MessagesPlaceholder for the agent's responses and workings.
    • Integration of Few-Shot Prompt: The dynamically created few-shot prompt from Step 3 will be incorporated here, providing context and examples for the agent's processing.
    
     full_prompt = ChatPromptTemplate.from_messages(
         [
             SystemMessagePromptTemplate(prompt=few_shot_prompt),
             ("human", "{input}"),
             MessagesPlaceholder("agent_scratchpad"),
         ]
     )
                    
    
     # Example formatted prompt
     prompt_val = full_prompt.invoke(
         {
             "input": "List of all sellers  with name address and their phone number",
             "top_k": 5,
             "dialect": "SQLite",
             "agent_scratchpad": [],
         }
     )
     print(prompt_val.to_string())
                    
    
     agent = create_sql_agent(
         llm=gpt,
         db=db,
         prompt=full_prompt,
         verbose=True,
         agent_type="openai-tools",
     )
     agent.invoke({"input": "List of all sellers  with name address and their phone number"})
                    
    
     agent.invoke({"input": "List of all sellers  with name address and their phone number"})
                    
    
     output
     > Entering new AgentExecutor chain...
    
     Invoking: `sql_db_query` with `SELECT Seller.Name, Seller.Address, Seller_Phone_num.Phone_num 
     FROM Seller JOIN Seller_Phone_num ON Seller.Seller_id = Seller_Phone_num.Seller_id`
    
    
     [('aman', 'delhi cmc', 9943336206), ('sara', 'mumbai abc', 9876543210), ('peter', 'kolkata xyz', 
     8765432109), ('lisa', 'bangalore pqr', 7654321098), ('mike', 'chennai lmn', 6543210987)]Here is a list 
     of all sellers with their name, address, and phone number:
    
    1. Seller Name: aman
       Address: delhi cmc
       Phone Number: 9943336206
    
    2. Seller Name: sara
       Address: mumbai abc
       Phone Number: 9876543210
    
    3. Seller Name: peter
       Address: kolkata xyz
       Phone Number: 8765432109
    
    4. Seller Name: lisa
       Address: bangalore pqr
       Phone Number: 7654321098
    
    5. Seller Name: mike
       Address: chennai lmn
       Phone Number: 6543210987
        
    
    > Finished chain.
                        

Conclusion

In conclusion, the integration of Large Language Models (LLMs) with SQL databases through dynamic few-shot prompting represents a significant advancement in the field of data management and natural language processing. By emulating human data analysis techniques, such as using example selectors and few-shot prompts, we can significantly enhance the accuracy and relevance of the SQL queries generated by these models. This approach not only addresses the challenges of hallucination and context window limitations but also leverages the evolving capabilities of LLMs to make data querying more intuitive and accessible for a wider range of users. As we continue to refine these methods, the potential for LLMs to revolutionize how we interact with and extract insights from vast data repositories becomes increasingly evident, paving the way for a future where complex data operations are seamlessly conducted through natural language interfaces.

Where AI meets SQL, queries transform into conversations, turning data exploration into a dialogue with knowledge itself

Back To Blogs


contact us