Sitemap

Apache Doris and DeepSeek: Redefining Intelligent Data Analytics

9 min readFeb 23, 2025

--

Apache Doris, a high — performance real — time data warehouse, has been shining brightly in the OLAP (Online Analytical Processing) field in recent years. With its powerful query performance, real — time data processing capabilities, and flexible data models, it is widely applied in scenarios such as business intelligence, user behavior analysis, and real — time monitoring.

The rise of DeepSeek, on the other hand, has injected new vitality into data analysis. Through DeepSeek’s natural language processing capabilities, users can directly query data using everyday language, greatly reducing the threshold for data analysis.

This article will focus on the in — depth integration of Apache Doris and DeepSeek, and will analyze in detail its technical implementation, optimization strategies, application scenarios, and future trends. It is hoped that through this content, you can comprehensively understand the potential of this combination and find a practical path suitable for your business.

Apache Doris in — depth analysis: The “data accelerator” in the OLAP field

1.1 Core features of Apache Doris

Apache Doris is a real — time data warehouse based on the MPP (Massively Parallel Processing) architecture, designed specifically for high — concurrency complex queries and data analysis. The following are its core features:

  • High — performance query:
  • Doris adopts a distributed architecture, splitting query tasks and executing them in parallel across multiple nodes, greatly improving query speed.
  • It supports sub — second query response and is suitable for high — concurrency point queries and complex analytical queries.
  • Real — time data processing:
  • Supports real — time data writing and updating, with strong data timeliness.
  • Achieves real — time data import through methods such as Stream Load and Routine Load.
  • Ease of use:
  • Compatible with the MySQL protocol and supports standard SQL, making it easy for developers and analysts to get started.
  • Provides a visual management tool, such as Doris Manager, for easy cluster operation and maintenance.
  • High availability:
  • A multi — copy mechanism ensures data reliability and supports automatic failure recovery.
  • The cluster is flexible in expansion and supports dynamic scaling.
  • Flexible data model:
  • Supports multiple data models, including Aggregate Key (aggregation model), Unique Key (unique key model), and Duplicate Key (detail model).
  • Adapts to different business scenarios, such as aggregation analysis and detailed query.

1.2 Technical architecture of Doris

The architecture of Doris is mainly divided into the following modules:

  • Frontend (FE):
  • Responsible for receiving SQL requests from users, parsing and optimizing query plans.
  • Manages metadata, including table structures, partition information, etc.
  • Backend (BE):
  • Responsible for data storage and calculation, and executes specific query tasks.
  • Data is stored in columnar format and supports vectorization and SIMD optimization to improve query efficiency.

1.3 Typical application scenarios of Doris

  • Business Intelligence (BI):
  • Generates real — time reports and dashboards and supports multi — dimensional analysis.
  • For example, analyzing indicators such as sales, user growth, and geographical distribution.
  • User behavior analysis:
  • Analyzes user behaviors such as clicks, purchases, and browsing to optimize the product experience.
  • For example, e — commerce platforms analyze user shopping paths to optimize recommendation systems.
  • Real — time monitoring:
  • Monitors system performance and business indicators and promptly detects anomalies.
  • For example, the manufacturing industry monitors production line status and triggers alarms.
  • Data lake acceleration:
  • Serves as a query acceleration layer for data lakes to improve query efficiency.
  • For example, in combination with the Hadoop ecosystem, it accelerates queries in Hive or Spark.

DeepSeek and Apache Doris integration: The future of intelligent data analysis

2.1 Why integrate DeepSeek and Doris?

In traditional data analysis, users need to be proficient in SQL to extract information from Doris. However, SQL has a relatively high threshold, and for business personnel, the learning cost is not small. The emergence of DeepSeek has broken this barrier. DeepSeek can understand natural language and translate users’ colloquial questions into SQL queries, making data analysis as simple as chatting.

2.2 Integration methods of LLM and Doris

  • Natural Language Query (NL2SQL):
  • Users ask questions in everyday language, such as “What was the sales amount in the last month?”
  • The LLM automatically generates an SQL query to extract the answer from Doris.
  • Data insight generation:
  • The LLM analyzes the query results of Doris and generates understandable insights and suggestions.
  • For example, analyzing the reasons for the decline in sales and proposing optimization suggestions.
  • Automated reporting:
  • The LLM automatically generates a complete report, including charts and text descriptions, based on Doris data.
  • For example, generating a weekly sales report, including sales trends, popular products, etc.
  • Vector retrieval capability (under development):
  • Through Doris’s own vector retrieval capabilities, there is no need to introduce vector libraries such as Faiss and Milvus.
  • Reduces the use cost of the overall architecture and unifies the data in a single component.

2.3 Advantages after integration

  • Lower the threshold:
  • Business personnel can conduct data analysis without learning SQL.
  • Improve efficiency:
  • Automatically generating SQL and reports saves a lot of time.
  • Enhance insights:
  • The LLM can dig deeper insights from Doris data to assist in decision — making.
  • Unify the technology stack:
  • The combination of real — time data warehousing and vectorized retrieval makes the architecture simple and the update efficiency fast.

Technical implementation: External system integration based on the RAG architecture

3.1 System architecture design

3.1.1 Current version

3.1.2 Future version

3.2 Core workflow

3.3 Key implementation details

  • Unified storage architecture:
  • Create a metadata index in the Doris cluster.
CREATE TABLE metadata_index (
table_name VARCHAR(64),
column_name VARCHAR(64),
description TEXT,
INDEX idx_desc(description) USING INVERTED
)
DUPLICATE KEY(table_name, column_name)
DISTRIBUTED BY HASH(table_name) BUCKETS 20;
  • Metadata management (pseudo — code):
def sync_metadata():
# Get metadata from the Doris system table
metadata = doris.execute("""
SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
FROM information_schema.columns
WHERE TABLE_SCHEMA = 'business_db'
""")
# Sync to the metadata index table
doris.load_data(
table="metadata_index",
data=metadata,
format="json"
)

3.4 Error retry and self — healing mechanism

  • Intelligent retry process (pseudo — code):
MAX_RETRY = 3
def execute_with_retry(query, context):
for _ in range(MAX_RETRY):
sql = deepseek.generate_sql(query, context)
try:
result = doris.execute(sql)
return result
except DorisException as e:
error_msg = parse_error(e)
context.append(f"Last error: {error_msg}\nError SQL: {sql}")
raise RetryExceededError()
  • Error type handling:
  • Correction Prompt example:
Known error: Unknown column'sales_amout' in field list
Original SQL: SELECT sales_amout FROM sales
Table structure: sales(sales_amount, order_date)
Please correct the SQL and explain the reason
  • Retry effect monitoring:
-- Record retry statistics information
CREATE TABLE retry_stats (
query_id BIGINT,
retry_count INT,
error_type VARCHAR(32),
final_state VARCHAR(16)
)
DUPLICATE KEY(query_id)
DISTRIBUTED BY HASH (query_id) BUCKETS 32;

3.5 Architecture evolution plan

It is expected that the vector retrieval capability will be fully integrated in July.

Performance optimization: Making intelligent analysis faster and more accurate

4.1 Query acceleration technology

  • Vectorized execution:
  • Utilize Doris’s vectorized engine to accelerate calculations.
  • Optimize operations such as aggregation and filtering through SIMD instructions.
  • Caching mechanism:
  • Establish a two — layer cache: SQL result cache generated by the LLM + common user question cache.
  • Use Redis to cache hot query results and set a TTL for automatic expiration.
  • Pre — calculation optimization:
CREATE MATERIALIZED VIEW sales_summary
REFRESH COMPLETE ON SCHEDULE EVERY 1 minute
AS
SELECT product_id, SUM(sales_amount), COUNT(*)
FROM sales
GROUP BY product_id;
  • Create materialized views for high — frequency queries and refresh them every minute.

4.2 LLM response optimization

  • Streaming output:
  • Start data pre — fetching when generating SQL to achieve output while calculating.
  • Result pagination:
def paginate_result(data, page_size = 100):
return [data[i:i + page_size] for i in range(0, len(data), page_size)]
  • Automatically paginate large datasets to avoid overloading single queries.
  • Multi — modal output:
  • Present results in a combination of text, charts (through Matplotlib/Plotly), and voice.

4.3 Resource isolation

  • Workload management:
-- Set up a dedicated resource group
create workload group if not exists metadata_group
properties (
"cpu_share" = "1024",
"memory_limit" = "50%",
"enable_memory_overcommit" = "false"
);
-- Set user permissions to use the resource group
GRANT USAGE_PRIV ON WORKLOAD GROUP'metadata_group' TO 'deepseek_doris'@'%';-- Set up a dedicated resource group
  • Allocate an independent resource group for LLM queries to avoid affecting regular BI queries.
  • Dynamic throttling:
  • Automatically adjust the LLM query concurrency based on cluster load.
  • Use the token bucket algorithm to control the request rate.

Application scenarios: Practical cases of intelligent data analysis

5.1 E — commerce scenario: Intelligent operation assistant

  • User question:
  • “Help me analyze the sales situation of the mobile phone category in East China last week and find out the reasons for the decline in sales.”
  • System response:
  1. Automatically generate SQL for comparing data with the same period and competitor analysis.
  2. After extracting data from Doris, the LLM identifies that “logistics delays have led to an increase in the return rate.”
  3. Generate visual charts + text reports.

5.2 Financial scenario: Real — time risk control Q&A

  • Implemented functions:
  • “How many abnormal transactions were there in the last hour?” → Real — time scanning of Doris stream data.
  • “Generate an anti — money laundering report” → Automatically associate multiple risk control tables.
  • Performance indicators:
  • The average response time from asking a question to generating a report is < 3 seconds.
  • Supports more than 50 concurrent risk queries.

5.3 IoT scenario: Intelligent device diagnosis

  • Typical application:
-- Automatically generated SQL for device failure analysis
SELECT device_id, AVG(temperature) as avg_temp,
COUNT(error_code) as error_count
FROM iot_metrics
WHERE ts > NOW() - INTERVAL 1 HOUR
GROUP BY device_id
HAVING avg_temp > 100 OR error_count > 5;-- Automatically generated SQL for device failure analysis
  • Execution effect:
  • Query response for tens of millions of device data is < 500ms.
  • Automatically generate maintenance suggestion work orders.

Challenges and solutions

6.1 Natural language ambiguity handling

  • Problem example:
  • “Show Beijing sales data” → Need to clarify whether it refers to Beijing city or the Beijing branch.
  • Solution:
def clarify_intent(question):
if "Beijing" in question:
return ["Please confirm whether it refers to Beijing city or the Beijing branch?"]
  • Build a business term dictionary.
  • Design a clarification dialogue process.

6.2 Complex query optimization

  • Challenge:
  • Multi — table JOINs and nested queries may lead to inefficient SQL generation.
  • Optimization strategy:
  • Automatically detect query patterns and recommend creating materialized views.
  • Prompt to simplify questions for queries with more than 3 table JOINs.

6.3 Data security and privacy

  • Protection measures:
GRANT Select_priv(col1,col2) ON ctl.db.tbl TO marketing; -- The marketing department can only access some fields
  • Automatically blur sensitive fields (such as mobile phone numbers → 138****5678).
  • Grade — authorize query results and control through Doris’s permission system.

Future outlook: The evolution direction of intelligent data analysis

7.1 Multi — modal interaction upgrade

  • Voice Q&A:
  • Support voice commands such as “Hey Doris, tell me about yesterday’s sales situation.”
  • AR data visualization:
  • Present 3D data maps through MR devices.

7.2 Automated decision — making loop

  • Intelligent triggering:
if "sales have decreased by more than 10%" in analysis_result:
trigger_alert(email = "ceo@company.com")
generate_promotion_plan() # Automatically generate a promotion plan
  • When the LLM detects an anomaly, automatically trigger business processes.

7.3 Continuous learning mechanism

  • Feedback loop:
  • User👍/👎 evaluations of answers are automatically used for model fine — tuning.
  • Schema dynamic adaptation:
  • When the Doris table structure changes, automatically update the LLM’s knowledge base.

7.4 More expectations

  • Automatically Profile analysis to optimize SQL.
  • Coverage of data governance and exploration.
  • Intelligent operation and maintenance of the Doris cluster.
  • ……

Conclusion

The in — depth integration of Apache Doris and DeepSeek is redefining the boundaries of data analysis. This combination not only realizes the transformation from “humans adapting to machines” to “machines understanding humans”, but also unleashes the true value of data assets through intelligent data processing and natural interaction. With the continuous evolution of technology, we look forward to seeing more enterprises achieving a qualitative leap in data — driven decision — making through this intelligent analysis combination.

--

--

No responses yet