Turning Natural Language into SQL Dialogues

Akash

Engineering Lead

15 min read

Tags:

Introduction

Bridging Language and Data in the Enterprise Sphere: In the realm of enterprise data management, SQL databases have long stood as the backbone, housing vast quantities of invaluable information. As enterprises increasingly lean on business intelligence (BI) tools to extract insights from these data repositories, the quest for more intuitive and accessible querying methods gains momentum. The emerging solution? Leveraging Large Language Models (LLMs) to interact with SQL databases using natural language. This innovative approach, powered by the latest advancements in NLP, promises to transform our interaction with data. However, as groundbreaking as it sounds, integrating LLMs into SQL querying is riddled with challenges, making it a complex and intriguing endeavor.

The Challenges of LLMs for SQL Query Generation:

While LLMs possess an impressive grasp of SQL, enabling them to construct complex queries, several non-trivial issues arise in actual implementations:

The Hallucination Dilemma: A significant hurdle is the tendency of LLMs to 'hallucinate' - creating fictitious tables or fields, or generating SQL queries that are incompatible with the actual database structure. This misalignment between the model's output and database reality poses a major obstacle in ensuring the validity and reliability of the queries generated.
The Context Window Constraint: LLMs operate within a context window, a limitation on the amount of text they can process at a given time. This becomes particularly problematic in the context of SQL databases, which often contain extensive and complex schemas. The challenge lies in effectively grounding the LLM in the database's reality without overwhelming its context window.

Error Handling and Unpredictability:Despite their capabilities, LLMs are not infallible. The SQL queries they generate may occasionally be erroneous or yield unexpected results. This inconsistency raises critical questions about dependability and error mitigation strategies. Do we accept these limitations or devise mechanisms to counter them?

The High Level Solution

Mimicking Human Expertise for Enhanced LLM Performance: To effectively address the challenges faced by Large Language Models (LLMs) in generating SQL queries, we can draw inspiration from the strategies employed by human data analysts. By emulating the steps a data analyst takes in querying SQL databases, we can guide LLMs towards more accurate and efficient query generation. Let's explore these human-inspired strategies:

Sample Queries and Schema Familiarization: Like a data analyst who first familiarizes themselves with the database by executing sample queries and reviewing table schemas, an LLM can benefit from a similar approach. This involves exposing the LLM to the structure and a snapshot of the data it will be working with. By understanding the actual makeup of the database - its tables, fields, and typical data entries - the LLM can generate queries that are more aligned with the reality of the database's structure.
Dealing with Information Overload and Context Window Limitation: Data analysts typically don't analyze all data at once; they often start with a subset, examining top rows or summary statistics. This approach can be mimicked to overcome the LLM's context window limitations. Instead of feeding the LLM with the entire database, we can provide it with a distilled version, highlighting key aspects of the data. This selective exposure ensures the LLM remains 'grounded' in the database's reality without overloading its processing capabilities.
Error Handling and Iterative Learning: Just as a data analyst doesn't abandon their task after encountering an error, LLMs should also be equipped to learn from mistakes. When an LLM generates an incorrect SQL query, mechanisms can be implemented to analyze the error, provide feedback to the model, and iteratively improve the query. This process not only refines the LLM's query-generation capabilities but also enhances its ability to adapt to different databases and query requirements.

The Path Forward:

By adopting these human-inspired approaches, we can significantly improve the effectiveness of LLMs in SQL query generation. These strategies not only address the specific challenges of hallucination, context window constraints, and error handling but also pave the way for LLMs to operate more autonomously and reliably in complex data environments. The journey towards integrating LLMs into SQL querying, thus, becomes a blend of technological innovation and cognitive emulation, holding promise for a new era of data interaction.

Langchain Agents

LangChain has a SQL Agent which provides a more flexible way of interacting with SQL Databases than a chain. The main advantages of using the SQL Agent are:

It can answer questions based on the databases' schema as well as on the databases' content (like describing a specific table).
It can recover from errors by running a generated query, catching the traceback and regenerating it correctly.
It can query the database as many times as needed to answer the user question.

To initialize the agent we'll use the create_sql_agent constructor. This agent uses the SQLDatabaseToolkit which contains tools to:

Create and execute queries
Check query syntax
Retrieve table descriptions
... and more

Coding with Agent

Github Link - here

Step 1: Creating a Database Schema

To simulate a real-world scenario, we will create a simple e-commerce database schema encompassing five entities: customer, seller, product, cart, and payment. Each entity will have its own attributes, and we will establish foreign key relationships to introduce complexity. This schema will serve as our testing ground for the LLM's SQL query generation.


 import sqlite3


 db_path = '/content/sample.db'
 conn = sqlite3.connect(db_path)
 cursor = conn.cursor()

 script='''
 CREATE TABLE Cart (
     Cart_id TEXT PRIMARY KEY NOT NULL
 );

 CREATE TABLE Customer (
     Customer_id TEXT PRIMARY KEY NOT NULL,
     c_pass TEXT NOT NULL,
     Name TEXT NOT NULL,
     Address TEXT NOT NULL,
     Pincode INTEGER NOT NULL,
     Phone_number_s INTEGER NOT NULL,
     Cart_id TEXT NOT NULL,
     FOREIGN KEY(Cart_id) REFERENCES Cart(Cart_id)
 );

 CREATE TABLE Seller (
     Seller_id TEXT PRIMARY KEY NOT NULL,
     s_pass TEXT NOT NULL,
     Name TEXT NOT NULL,
     Address TEXT NOT NULL
 );

 CREATE TABLE Seller_Phone_num (
     Phone_num INTEGER NOT NULL,
     Seller_id TEXT NOT NULL,
     PRIMARY KEY (Phone_num, Seller_id),
     FOREIGN KEY (Seller_id) REFERENCES Seller(Seller_id) ON DELETE CASCADE
 );

 CREATE TABLE Payment (
     payment_id TEXT PRIMARY KEY NOT NULL,
     payment_date DATE NOT NULL,
     Payment_type TEXT NOT NULL,
     Customer_id TEXT NOT NULL,
     Cart_id TEXT NOT NULL,
     total_amount REAL,
     FOREIGN KEY (Customer_id) REFERENCES Customer(Customer_id),
     FOREIGN KEY (Cart_id) REFERENCES Cart(Cart_id)
 );

 CREATE TABLE Product (
     Product_id TEXT PRIMARY KEY NOT NULL,
     Type TEXT NOT NULL,
     Color TEXT NOT NULL,
     P_Size TEXT NOT NULL,
     Gender TEXT NOT NULL,
     Commission INTEGER NOT NULL,
     Cost INTEGER NOT NULL,
     Quantity INTEGER NOT NULL,
     Seller_id TEXT,
     FOREIGN KEY (Seller_id) REFERENCES Seller(Seller_id) ON DELETE SET NULL
 );

 CREATE TABLE Cart_item (
     Quantity_wished INTEGER NOT NULL,
     Date_Added DATE NOT NULL,
     Cart_id TEXT NOT NULL,
     Product_id TEXT NOT NULL,
     purchased TEXT DEFAULT 'NO',
     PRIMARY KEY (Cart_id, Product_id),
     FOREIGN KEY (Cart_id) REFERENCES Cart(Cart_id),
     FOREIGN KEY (Product_id) REFERENCES Product(Product_id)
 );

 '''

 cursor.executescript(script)
 conn.commit()
 conn.close()

Step 2: Installing Dependencies and Connecting to the Database

We'll need to install several dependencies to facilitate the interaction between our code and the database, as well as to integrate the LLM:

langchain: A toolkit that aids in integrating LLMs with various applications.
openai: The OpenAI Python package, essential for interacting with GPT models.
langchain-openai: An extension to langchain for OpenAI-specific functionalities.
langchain-experimental: Provides experimental features in langchain, possibly including advanced integrations and tools.

We will use the SQLDatabase toolkit provided by langchain to load the model and interface with our SQL database.


 !pip install -q langchain openai langchain-openai langchain-experimental faiss-cpu


 from langchain.agents import create_sql_agent
 from langchain.agents.agent_toolkits import SQLDatabaseToolkit
 from langchain.sql_database import SQLDatabase
 from langchain_openai import OpenAI
 from langchain.agents import AgentExecutor
 from langchain.agents.agent_types import AgentType
 from langchain.chat_models import ChatOpenAI


 from google.colab import userdata
 OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')


 db = SQLDatabase.from_uri("sqlite:////content/sample.db")
 gpt = ChatOpenAI(openai_api_key=OPENAI_API_KEY,model="gpt-3.5-turbo", temperature=0)

Step 3: Setting Up an Agent and Querying

To utilize the LLM for SQL generation, we will set up an agent executor using langchain's create_sql_agent. This agent will serve as the intermediary between our natural language prompts and the SQL database, translating our queries into executable SQL statements.

Creating the SQL Agent: Utilize create_sql_agent to initialize our agent with the necessary configurations to interact with our e-commerce database schema.
Executing Queries: We will use the agent's ask method to input natural language prompts. The agent will then process these prompts, generate SQL queries, and execute them against our database. The output will be the result of these queries, demonstrating the LLM's ability to interact with SQL databases.

We will use the SQLDatabase toolkit provided by langchain to load the model and interface with our SQL database.


 toolkit = SQLDatabaseToolkit(db=db, llm=gpt)


 agent_executor = create_sql_agent(
     llm=gpt,
     toolkit=toolkit,
     verbose=True,
     agent_type="openai-tools",
 )


 question = "what is the cart amount for customer with customer id cid101"
 agent_executor.run(question)


 Logs

 Entering new AgentExecutor chain...

 Invoking: `sql_db_list_tables` with `{}`


 Cart, Cart_item, Customer, Payment, Product, Seller, Seller_Phone_num
 Invoking: `sql_db_schema` with `Cart, Customer`

 3 rows from Cart table:
 Cart_id
 crt1011
 crt1012
 crt1013
 */

 /*
 3 rows from Customer table:
 Customer_id	c_pass	Name	Address	Pincode	Phone_number_s	Cart_id
 cid100	ABCM1235	rajat	G-453	632014	9893135876	crt1011
 cid101	XYZ5678	john	F-789	560023	9876543210	crt1012
 cid102	PQRS9876	jane	A-321	400001	8765432109	crt1013
 */
 Invoking: `sql_db_query` with `SELECT c.Cart_id, c.Customer_id, c.Name, c.Address, 
 c.Pincode, c.Phone_number_s, SUM(p.Price) AS Cart_Amount FROM Customer c JOIN 
 Cart_item ci ON c.Cart_id = ci.Cart_id JOIN Product p ON ci.Product_id = p.Product_id 
 WHERE c.Customer_id = 'cid101'` 


 Error: (sqlite3.OperationalError) no such column: p.Price
 [SQL: SELECT c.Cart_id, c.Customer_id, c.Name, c.Address, c.Pincode, 
 c.Phone_number_s, SUM(p.Price) AS Cart_Amount FROM Customer c JOIN Cart_item ci 
 ON c.Cart_id = ci.Cart_id JOIN Product p ON ci.Product_id = p.Product_id WHERE 
 c.Customer_id = 'cid101']
 (Background on this error at: https://sqlalche.me/e/20/e3q8)
 Invoking: `sql_db_schema` with `Product`

 /*
 3 rows from Product table:
 Product_id	Type	Color	P_Size	Gender	Commission	Cost	Quantity	Seller_id
 pid1001	jeans	red	32	M	10	10005	20	sid100
 pid1002	shirt	blue	L	F	5	5000	15	sid101
 pid1003	shoes	black	40	M	8	8000	25	sid102
 */
 Invoking: `sql_db_query` with `SELECT c.Cart_id, c.Customer_id, c.Name, c.Address, 
 c.Pincode, c.Phone_number_s, SUM(p.Cost) AS Cart_Amount FROM Customer c JOIN 
 Cart_item ci ON c.Cart_id = ci.Cart_id JOIN Product p ON ci.Product_id = p.Product_id 
 WHERE c.Customer_id = 'cid101'`


 [('crt1012', 'cid101', 'john', 'F-789', 560023, 9876543210, 5000)]The cart amount 
 for the customer with customer id 'cid101' is 5000.

 > Finished chain.
  

 'The cart amount for the customer with customer id 'cid101' is 5000.

Coding Using a dynamic few-shot prompt

Github Link - here

Implementing a dynamic few-shot prompt involves several steps. This approach helps optimize the performance of our agent by incorporating domain-specific knowledge and context relevant to the user's input. Here's a breakdown of these steps:

Step 1: Creating Example Pairs

Firstly, we need a set of user input and corresponding SQL query examples. These examples should cover a range of typical queries related to our e-commerce schema, providing a base for the LLM to learn from.


 examples = [
    {"input": "List all customers.", "query": "SELECT * FROM Customer;"},
    {
        "input": "List all the products from a particular seller with seller_id 
 'sid100'",
        "query": "SELECT Product.Product_id, Product.Type, Product.Color,
 Product.P_Size, Product.Gender, Product.Commission, Product.Cost, Product.Quantity
 FROM Product WHERE Product.Seller_id = 'sid100';",
    },
    {
        "input": "List of all sellers  with name address and their phone number",
        "query": "SELECT Seller.Seller_id, Seller.Name AS Seller_Name, 
 Seller_Phone_num.Phone_num AS Seller_Phone_Number FROM Seller JOIN Seller_Phone_num 
 ON Seller.Seller_id = Seller_Phone_num.Seller_id;",
    },
    {
        "input": "List of all the items in the cart of a customer whose Name is 
 'jane'",
        "query": "SELECT Cart_item.Quantity_wished,  Cart_item.Date_Added, 
 Cart_item.Cart_id, Cart_item.Product_id, Cart_item.purchased FROM Cart_item JOIN 
 Cart ON Cart_item.Cart_id = Cart.Cart_id JOIN Customer ON Cart.Customer_id = 
 Customer.Customer_id WHERE Customer.Name = 'Jane';;",
    },
    {
        "input": "List all customers whose address is Canada.",
        "query": "SELECT * FROM Customer WHERE Country = 'Canada';",
    },
    {
        "input": "How many customers have purchased the product with Product_id 
 pid1002",
        "query": "SELECT COUNT(DISTINCT Customer_id) AS Num_Customers_Purchased FROM 
 Cart_item WHERE Product_id = 'pid1002' AND purchased = 'Y'; ",
    },
    {
        "input": "Find the total number of Products.",
        "query": "SELECT COUNT(*) FROM Product;",
    },
    {
        "input": "How many customers are there",
        "query": 'SELECT COUNT(*) AS Total_Customers FROM "Customer"',
    },
]

Step 2: Example Selector Creation

We then create an ExampleSelector, specifically a SemanticSimilarityExampleSelector. This selector uses semantic search to find the most relevant examples from our set based on the user's input.

Semantic Search: The selector will analyze the semantics of the user's input and compare it with the example pairs to find the most similar ones.
Embeddings and Vector Store: It utilizes embeddings and a vector store to perform this semantic analysis efficiently.


 from langchain_community.vectorstores import FAISS
 from langchain_core.example_selectors import SemanticSimilarityExampleSelector
 from langchain_openai import OpenAIEmbeddings

 example_selector = SemanticSimilarityExampleSelector.from_examples(
     examples,
     OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY),
     FAISS,
     k=5,
     input_keys=["input"],
 )

Step 3: FewShotPromptTemplate Creation

Next, we create a FewShotPromptTemplate. This template will integrate the selected examples into a structured prompt that the LLM can understand.

Components:

Example Selector: Incorporates the previously created SemanticSimilarityExampleSelector.
Example Prompt Formatting: A format for how each example should be presented within the prompt.
Prefix and Suffix: Text to be added before and after the formatted examples to frame the prompt contextually.


 from langchain_core.prompts import (
     ChatPromptTemplate,
     FewShotPromptTemplate,
     MessagesPlaceholder,
     PromptTemplate,
     SystemMessagePromptTemplate,
 )

 system_prefix = """You are an agent designed to interact with a SQL 
 database.
 Given an input question, create a syntactically correct {dialect} query 
 to run, then look at the results of the query and return the answer.
 Unless the user specifies a specific number of examples they wish to 
 obtain, always limit your query to at most {top_k} results.
 You can order the results by a relevant column to return the most 
 interesting examples in the database.
 Never query for all the columns from a specific table, only ask for the 
 relevant columns given the question. 
 You have access to tools for interacting with the database.
 Only use the given tools. Only use the information returned by the tools 
 to construct your final answer.
 You MUST double check your query before executing it. If you get an 
 error while executing a query, rewrite the query and try again.
 
 DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to 
 the database.

 If the question does not seem related to the database, just return "I 
 don't know" as the answer.

 Here are some examples of user inputs and their corresponding SQL 
 queries:"""

 few_shot_prompt = FewShotPromptTemplate(
     example_selector=example_selector,
     example_prompt=PromptTemplate.from_template(
         "User input: {input}\nSQL query: {query}"
     ),
     input_variables=["input", "dialect", "top_k"],
     prefix=system_prefix,
     suffix="",
 )

Step 4: Constructing the Full Prompt

As we are using an OpenAI tools agent, our full prompt should be structured as a chat prompt.

Human Message Template: This is where the user's natural language input will be placed.
Agent Scratchpad: MessagesPlaceholder for the agent's responses and workings.
Integration of Few-Shot Prompt: The dynamically created few-shot prompt from Step 3 will be incorporated here, providing context and examples for the agent's processing.


 full_prompt = ChatPromptTemplate.from_messages(
     [
         SystemMessagePromptTemplate(prompt=few_shot_prompt),
         ("human", "{input}"),
         MessagesPlaceholder("agent_scratchpad"),
     ]
 )


 # Example formatted prompt
 prompt_val = full_prompt.invoke(
     {
         "input": "List of all sellers  with name address and their phone number",
         "top_k": 5,
         "dialect": "SQLite",
         "agent_scratchpad": [],
     }
 )
 print(prompt_val.to_string())


 agent = create_sql_agent(
     llm=gpt,
     db=db,
     prompt=full_prompt,
     verbose=True,
     agent_type="openai-tools",
 )
 agent.invoke({"input": "List of all sellers  with name address and their phone number"})


 agent.invoke({"input": "List of all sellers  with name address and their phone number"})


 output
 > Entering new AgentExecutor chain...

 Invoking: `sql_db_query` with `SELECT Seller.Name, Seller.Address, Seller_Phone_num.Phone_num 
 FROM Seller JOIN Seller_Phone_num ON Seller.Seller_id = Seller_Phone_num.Seller_id`


 [('aman', 'delhi cmc', 9943336206), ('sara', 'mumbai abc', 9876543210), ('peter', 'kolkata xyz', 
 8765432109), ('lisa', 'bangalore pqr', 7654321098), ('mike', 'chennai lmn', 6543210987)]Here is a list 
 of all sellers with their name, address, and phone number:

1. Seller Name: aman
   Address: delhi cmc
   Phone Number: 9943336206

2. Seller Name: sara
   Address: mumbai abc
   Phone Number: 9876543210

3. Seller Name: peter
   Address: kolkata xyz
   Phone Number: 8765432109

4. Seller Name: lisa
   Address: bangalore pqr
   Phone Number: 7654321098

5. Seller Name: mike
   Address: chennai lmn
   Phone Number: 6543210987
    

> Finished chain.

Conclusion

In conclusion, the integration of Large Language Models (LLMs) with SQL databases through dynamic few-shot prompting represents a significant advancement in the field of data management and natural language processing. By emulating human data analysis techniques, such as using example selectors and few-shot prompts, we can significantly enhance the accuracy and relevance of the SQL queries generated by these models. This approach not only addresses the challenges of hallucination and context window limitations but also leverages the evolving capabilities of LLMs to make data querying more intuitive and accessible for a wider range of users. As we continue to refine these methods, the potential for LLMs to revolutionize how we interact with and extract insights from vast data repositories becomes increasingly evident, paving the way for a future where complex data operations are seamlessly conducted through natural language interfaces.

Where AI meets SQL, queries transform into conversations, turning data exploration into a dialogue with knowledge itself

Back To Blogs

Akash Kumar Pavadashetti

Engineering Lead

Senior Full Stack Engineer with experience in designing architecture and schema of an application. Well versed with Python, Django, React and AWS services. A good communicator who takes ownership in what he does.

Python Scripting Django WxPython React Redux Javascript MySQL Redis Gunicorn

Evolution of JSONB - PostgreSQL

The evolution of JSONB in PostgreSQL has allowed for more efficient storage and querying of JSON data, making it a...

Akash

2023-01-16

React

6 min read

Code splitting (bundle-split) in React

Code splitting in React refers to the process of separating a large codebase into smaller, more manageable chunks or...

Navnit

2022-12-27

Advanced SQL Part-4 | Window Functions Use Cases

PostgreSQL

9 min read

Advanced SQL Part-4 | Window Functions Use Cases

In the Part 1 of this blog series, we learnt how to create a frame using frame_types - ROWS, RANGE and GROUPS...

Evolution of JSONB - PostgreSQL

Akash

2023-01-16

Code splitting (bundle-split) in React

Navnit

2022-12-27

Advanced SQL Part-4 | Window Functions Use Cases

Akash

2022-11-23

Speaking SQL: Turning Natural Language into Database Dialogues

Akash

Engineering Lead

Introduction

The Challenges of LLMs for SQL Query Generation:

The High Level Solution

Langchain Agents

Coding with Agent

Coding Using a dynamic few-shot prompt

Conclusion

Akash Kumar Pavadashetti

Engineering Lead

Akash Engineering Lead

Introduction

The Challenges of LLMs for SQL Query Generation:

The High Level Solution

Langchain Agents

Coding with Agent

Coding Using a dynamic few-shot prompt

Conclusion

Akash Kumar Pavadashetti

Engineering Lead

Evolution of JSONB - PostgreSQL

Akash 2023-01-16

Code splitting (bundle-split) in React

Navnit 2022-12-27

Advanced SQL Part-4 | Window Functions Use Cases

Akash 2022-11-23

Akash

Engineering Lead

Akash

2023-01-16

Navnit

2022-12-27

Akash

2022-11-23