Unlocking the Power of Your Unstructured Data: Building a RAG Application with Snowflake Cortex

Published on February 10, 2025

/

Unlocking the Power of Your Unstructured Data: Building a RAG Application with Snowflake Cortex

Published on February 10, 2025 | 1 mins read

Information access is paramount in today’s fast-paced business environment, whether the underlying data is structured or unstructured. Imagine unlocking the insights hidden within your organization’s vast stores of unstructured data—from internal documentation and knowledge bases to customer communications and research reports!

Snowflake makes organizing and accessing the previously untapped goldmine of unstructured data possible, turning it into a powerful driver of efficiency and innovation. This post explores one application of this capability: building a Retrieval Augmented Generation (RAG) application using Snowflake Cortex. We’ll transform an internal knowledge base into an intelligent, readily available resource, demonstrating principles applicable to all your unstructured data.

Many organizations struggle with managing unstructured information. Whether internal documentation is scattered across various systems or valuable insights are buried within customer feedback, this data often remains inaccessible and underutilized. A RAG application offers a powerful solution by combining large language model’s (LLMs) capabilities with your data, creating a dynamic and easily searchable knowledge repository. In this post, we’ll walk through leveraging Snowflake’s robust platform to enable a conversation with unstructured data.

From Confluence to Conversation: The Architecture

Our example RAG application, “DAS42 DocsBot,” follows a clear and compelling architecture:

1. Confluence Data Extraction: We use Confluence’s API to extract internal documentation (policies, procedures, etc.). The initial extraction pulls all Confluence pages as PDFs. Subsequent runs only retrieve updated or changed pages, ensuring we work with the most current information while optimizing performance. We store the space_id, page_id, and page_title as filename elements when generating these PDFs. These are important for building what the app’s end-user sees later on.

2. Staging PDFs in Snowflake: These PDFs are then stored in a named internal stage in Snowflake. Overwriting existing PDFs ensures only the latest versions are kept. However, we recognize there are use cases for maintaining a complete history of both current and past versions of Confluence pages. That would enable change tracking over time. For our use, keeping only the current version met the requirement.

3. Text Extraction with Snowpark: When running the “CREATE OR REPLACE TABLE” statement we discuss below, we also execute a Snowpark (Python) function to extract the text content from the PDFs in the Snowflake stage. This function utilizes Snowflake’s compute resources for efficient processing. Below is not the entire Snowpark function but an excerpt highlighting what we’re doing here. We quickly read the contents of a PDF file, breaking larger pages into smaller segments, and store those “chunks” of text as values in the table we later create.

def process(self, file_url: str):
    text = self.read_pdf(file_url)

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 2000,  # Adjust this as needed
        chunk_overlap = 300, # Overlap to keep chunks contextual
        length_function = len
    )
    chunks = text_splitter.split_text(text)
    df = pd.DataFrame(chunks, columns=['chunk'])
    yield from df.itertuples(index=False, name=None)

4. Table Creation and Link Generation: A “CREATE OR REPLACE TABLE” statement executes the Snowpark function referenced above, extracting the PDF’s text and storing it in a Snowflake table. Importantly, this step also parses page IDs and creates clickable links to the original Confluence pages, providing context and enabling users to verify the information quickly. In the example code below, you’ll see references to those space_id, page_id, and page_title elements discussed earlier:

create or replace table your_db.your_schema.your_table as
select
    relative_path,
    split_part(relative_path, '/', 1) as space_id,
    split_part(relative_path, '/', 2) as page_id,
    'https://your_confluence_space_name.atlassian.net/wiki/spaces/' || space_id || '/pages/' || page_id AS clickable_url,
    replace(split_part(relative_path, '/', 4), '.pdf') as page_title,
    CONCAT(relative_path, ': ', func.chunk) AS chunk,
    'English' AS language
from
    directory(@your_documents_stage_namespace), table(your_db.your_schema.your_text_extracting_function(build_scoped_file_url(@your_documents_stage_namespace, relative_path))) as func;

5. Snowflake Cortex Search Service: We then create a Snowflake Cortex Search Service on the table. This service parses and indexes the text data, enabling fast and relevant search capabilities. It is similar to having a search engine built for your unstructured data within Snowflake’s secure ecosystem!

create or replace cortex search service your_db.your_schema.your_search_service
on chunk
attributes language
warehouse = your_warehouse
target_lag = '24 hours' # set to what works for you; we knew we’d only extract from Confluence daily anyway
as (
    select
        chunk,
        page_title,
        clickable_url,
        language
    from your_db.your_schema.your_table
);

6. Streamlit in Snowflake Application: A Streamlit in Snowflake application serves as the user interface. Users ask questions in natural language (e.g., “What is our policy on reimbursements?”).

The user’s question is passed to a Snowflake Cortex LLM to generate a query.
This query is then sent to the Cortex Search Service, where relevant text blocks from Confluence page extracts are retrieved.
The Streamlit application sends the user’s question, any existing chat history, the retrieved text blocks, and a prompt to a Snowflake Cortex LLM. This prompt instructs Cortex to generate a concise and informative answer summary.
The summarized answer and links to the source Confluence pages (which we parsed via SQL in our CREATE OR REPLACE TABLE statement earlier) are presented to the user.

The Snowflake Advantage: Simplifying Unstructured Data

This solution showcases Snowflake’s power in handling unstructured data. In this case, we used Snowflake tools to build a robust and scalable RAG application to answer questions on our internal documentation. From storage to text extraction, indexing, and search, Snowflake’s Cortex Search Service and LLMs combine to unlock our unstructured data, making it a readily accessible resource for everyone.

Key Benefits

Simplicity and Maintenance: The Snowflake data platform simplifies a complex use case by keeping all moving parts ‘under one roof’. There aren’t multiple breaking points; all relevant objects are defined in familiar SQL syntax.
Centralized Knowledge: All internal documentation is consolidated into a searchable repository.
Increased Efficiency: Employees can quickly locate needed information, saving time and boosting productivity.
Improved Accuracy: Grounding LLM responses in actual data ensures accuracy and minimizes hallucinations.
Scalability and Security: Snowflake’s platform provides scalability and robust security for sensitive data.

Call to Action

We encourage you to explore your potential with Snowflake and unlock the value in your unstructured data. Contact us to learn how we can help you become a truly data-driven organization.

Tags:

RELATED RESOURCES