Astro-News-Bot: Building an AI-Driven Automated News Aggregation and Publishing System

Preface: Creating a Personal Intelligence Officer

As a developer, keeping up with the rapidly changing tech world is a daily challenge. High-quality tech news, blogs, and research reports are scattered across the internet, while existing news aggregators often fall short—either cluttered with ads, opaque recommendation algorithms, or lack integration with personal knowledge bases.

So I decided to build my own AI-powered news intelligence officer—astro-news-bot. This isn’t just a simple web scraper, but an end-to-end, fully automated, AI-based news processing and publishing pipeline that seamlessly publishes curated content to my personal blog geyuxu.com, built with Astro.

After development and iteration, the system now runs stably, automatically processing about 31 articles daily across 7 tech categories. This article will detail the design philosophy, technical implementation, Astro blog integration approach, and engineering practices of astro-news-bot.

Overall Architecture: A Modular Data Processing Pipeline

Design Principles

During the initial architecture design, I followed several core principles:

Modularity: Each function (crawling, deduplication, AI processing, publishing) should be an independent module, easy to maintain and replace
Automation: The entire process requires no manual intervention, achieving “set once, run forever”
Idempotency: Repeated task execution produces no side effects
Scalability: Easy to add new news sources or processing steps

Technical Architecture

Based on these principles, I designed a linear data processing pipeline:

RSS Sources → Fetcher → Vector Dedup → AI Summary → Markdown Generation → Git Publishing → Blog Deployment

Core module structure:

astro-news-bot/
├── news_bot/
│   ├── fetcher.py      # News fetching
│   ├── dedup.py        # Vector deduplication
│   ├── summarizer.py   # AI summarization
│   ├── writer.py       # Markdown generation
│   ├── selector.py     # News filtering
│   ├── publisher.py    # Git publishing
│   └── job.py          # Workflow orchestration
├── config.json         # Configuration file
├── requirements.txt    # Dependencies
└── run_daily_news.sh  # Execution script

Data Flow Design

News Fetching → Multi-source crawling → raw_{date}.json
Vector Deduplication → Semantic similarity filtering → dedup_{date}.json
AI Summarization → GPT-4o generates Chinese summaries → summary_{date}.json
Markdown Generation → Organized by category → news_{date}.md
Git Publishing → Push to blog repository → Trigger automatic deployment

Core Technical Implementation

1. Vector Deduplication: Beyond Simple Title Matching

In the early stages, I used article titles or URLs for deduplication, but quickly discovered issues:

Different news sources use different titles for the same event
URLs may differ due to tracking parameters

The solution is semantic-based vector deduplication:

# dedup.py core implementation
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

class NewsDeduplicator:
    def __init__(self, similarity_threshold=0.85):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.threshold = similarity_threshold
        
    def deduplicate(self, articles):
        if not articles:
            return []
            
        # Extract title text
        titles = [article['title'] for article in articles]
        
        # Generate vector embeddings
        embeddings = self.model.encode(titles)
        
        # Calculate similarity matrix
        similarity_matrix = cosine_similarity(embeddings)
        
        # Deduplication logic
        to_keep = []
        for i, article in enumerate(articles):
            is_duplicate = False
            for j in to_keep:
                if similarity_matrix[i][j] > self.threshold:
                    is_duplicate = True
                    break
            if not is_duplicate:
                to_keep.append(i)
                
        return [articles[i] for i in to_keep]

This method effectively identifies articles that “say the same thing in different ways,” far more accurate than keyword matching.

2. AI Summarization and Classification: The Art of Prompt Engineering

The quality of AI summarization and classification directly determines the final output value. The key isn’t which LLM to choose, but how to design Prompts:

# summarizer.py core Prompt
SUMMARY_PROMPT = """
You are a professional tech news editor specializing in organizing news summaries for developers.

Please generate for the following news:
1. A Chinese summary of no more than 100 words, summarizing core information
2. Select the most appropriate category from: Artificial Intelligence, Mobile Technology, Autonomous Driving, Cloud Computing, Chip Technology, Venture Capital, Cybersecurity, Blockchain, Scientific Research, Other Tech

News content:
Title: {title}
Description: {description}
Source: {source}

Please return in JSON format:
{{
  "summary": "Summary content",
  "category": "Category name",
  "tags": ["Tag1", "Tag2", "Tag3"]
}}
"""

class NewsSummarizer:
    def __init__(self):
        self.client = OpenAI()
        
    def process_article(self, article):
        prompt = SUMMARY_PROMPT.format(**article)
        
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=500,
            temperature=0.3
        )
        
        return json.loads(response.choices[0].message.content)

Through clear roles, detailed instructions, and output format requirements, we achieve stable, high-quality AI output.

3. GitOps Publishing: Reliable Automated Deployment

Why choose Git instead of API operations on the blog backend?

Atomicity and Traceability: Every content update is a Git commit, providing clear change records
Decoupling and Security: The bot only needs Git repository write permissions, no need to expose blog backend credentials
Leverage Existing CI/CD: Reuse Git-triggered CI/CD from platforms like Vercel/Netlify

# publisher.py core implementation
import subprocess
import os

class NewsPublisher:
    def __init__(self, blog_repo_path):
        self.repo_path = blog_repo_path
        
    def publish(self, commit_message):
        try:
            # Switch to blog directory
            os.chdir(self.repo_path)
            
            # Pull latest code
            subprocess.run(['git', 'pull'], check=True)
            
            # Add new files
            subprocess.run(['git', 'add', '.'], check=True)
            
            # Check for changes
            result = subprocess.run(['git', 'diff', '--cached', '--exit-code'], 
                                  capture_output=True)
            if result.returncode == 0:
                print("No changes to commit")
                return
                
            # Commit and push
            subprocess.run(['git', 'commit', '-m', commit_message], check=True)
            subprocess.run(['git', 'push'], check=True)
            
            print(f"Successfully published: {commit_message}")
            
        except subprocess.CalledProcessError as e:
            print(f"Git operation failed: {e}")

Astro Blog Integration Modifications

To achieve seamless integration between astro-news-bot and the Astro blog, minimal but crucial modifications to the blog project are needed:

1. Define News Content Collection

// src/content/config.ts
import { defineCollection, z } from 'astro:content';
import { glob } from 'astro/loaders';

const news = defineCollection({
  loader: glob({ base: './src/content/news', pattern: '**/*.md' }),
  schema: z.object({
    title: z.string(),
    description: z.string().optional(),
    date: z.string().optional(),
    pubDate: z.string().optional(),
    tags: z.array(z.string()).optional(),
    layout: z.string().optional(),
  }),
});

export const collections = {
  'news': news,
  // ... other collections
};

2. Create LatestNews Component

---
// src/components/LatestNews.astro
import { getCollection } from 'astro:content';

const newsEntries = await getCollection('news');

let latestNews = null;
if (newsEntries && newsEntries.length > 0) {
  latestNews = newsEntries
    .filter(entry => entry.data && (entry.data.date || entry.data.pubDate))
    .sort((a, b) => {
      const dateA = new Date(a.data.date || a.data.pubDate);
      const dateB = new Date(b.data.date || b.data.pubDate);
      return dateB.getTime() - dateA.getTime();
    })
    .slice(0, 1)[0];
}
---

{latestNews && (
  <div class="latest-news">
    <h3>📰 Latest News</h3>
    <div class="news-item">
      <h4>{latestNews.data.title}</h4>
      <p>{latestNews.data.description}</p>
      <a href={`/news/${latestNews.data.date || latestNews.data.pubDate}`}>
        Read More →
      </a>
    </div>
  </div>
)}

<style>
  .latest-news {
    border: 1px solid #e1e5e9;
    border-radius: 8px;
    padding: 1.5rem;
    margin: 1rem 0;
    background: #f8fafc;
  }
  
  .news-item h4 {
    margin: 0 0 0.5rem 0;
    color: #1a1a1a;
  }
  
  .news-item p {
    color: #666;
    margin: 0 0 1rem 0;
  }
  
  .news-item a {
    color: #2563eb;
    text-decoration: none;
    font-weight: 500;
  }
</style>

3. Fix Dynamic Route Rendering

Due to structural changes in entry objects with glob loader in Astro 5.x, adaptation is needed:

---
// src/pages/news/[date].astro
import { getCollection, getEntry } from 'astro:content';

export async function getStaticPaths() {
  const newsEntries = await getCollection('news');
  
  return newsEntries
    .filter(entry => entry.data && (entry.data.date || entry.data.pubDate))
    .map(entry => ({
      params: { 
        date: entry.data.date || entry.data.pubDate 
      },
      props: { 
        entryId: entry.id,  // Use entry.id instead of slug
        dateParam: entry.data.date || entry.data.pubDate
      }
    }));
}

const { entryId } = Astro.props;
const post = await getEntry('news', entryId);

if (!post) {
  throw new Error(`No news entry found for entryId: ${entryId}`);
}
---

<html>
  <body>
    <main>
      <h1>{post.data.title}</h1>
      <!-- Use pre-rendered content -->
      <div set:html={post.rendered.html}></div>
    </main>
  </body>
</html>

Key modification points:

Use entry.id instead of entry.slug (slug is undefined in glob loader)
Use post.rendered.html to get pre-rendered content
Get complete entry object via getEntry during page rendering

Diverse Execution Methods

To adapt to different deployment environments, I designed multiple execution methods:

1. Direct Execution (Development & Debugging)

# Complete workflow
python -m news_bot.job --date $(date +%Y-%m-%d)

# Dry run mode (skip publishing)
python -m news_bot.job --date 2025-07-25 --dry-run

2. Shell Script Execution

#!/bin/bash
# run_daily_news.sh

cd "$(dirname "$0")"
source .env

# Create log directory
mkdir -p ~/logs

# Execute news processing workflow
DATE=$(date +%Y-%m-%d)
echo "=== Starting news processing for $DATE ===" >> ~/logs/news_bot.log

python -m news_bot.job --date $DATE >> ~/logs/news_bot.log 2>&1

echo "=== Completed at $(date -Iseconds) ===" >> ~/logs/news_bot.log

3. Daemon Mode (Recommended)

# Start daemon (complete background execution)
./start_daemon.sh start

# Check running status  
./start_daemon.sh status

# View logs
./start_daemon.sh logs

# Graceful stop
./stop_daemon.sh

4. Cron Scheduled Tasks

# Edit crontab
crontab -e

# Execute daily at 8:05
5 8 * * * /Users/geyuxu/repo/astro-news-bot/run_daily_news.sh

Operations Experience and Best Practices

Configuration Management

{
  "output_config": {
    "blog_content_dir": "/Users/geyuxu/repo/blog/geyuxu.com/src/content/news",
    "filename_format": "news_{date}.md",
    "use_blog_dir": true
  },
  "git_config": {
    "target_branch": "gh-pages",
    "auto_switch_branch": true,
    "push_to_remote": true
  },
  "news_config": {
    "max_articles_per_day": 6,
    "token_budget_per_day": 4000,
    "similarity_threshold": 0.85
  },
  "llm_config": {
    "model": "gpt-4o",
    "max_tokens": 500,
    "temperature": 0.3
  },
  "scheduler_config": {
    "enabled": true,
    "timezone": "Asia/Shanghai",
    "cron_expression": "0 8 * * *"
  }
}

Cost Control

Daily processing: ~6 articles
Estimated token consumption: ~4000 tokens/day
OpenAI cost: ~$0.01-0.05/day

Log Management

# View real-time logs
tail -f logs/daemon.log

# View scheduler logs
tail -f logs/scheduler.log

# View today's execution records
grep "$(date +%Y-%m-%d)" ~/logs/news_bot.log

Project Value and Results

Test Validation Results

Latest test (2025-07-26):

✅ Fetcher: Retrieved 31 tech news articles (RSS sources)
✅ Deduplicator: Vector deduplication, retained 31 unique articles
✅ Summarizer: AI summary generation, used 10,681 tokens
✅ Writer: Generated 188-line Markdown with 7 tech categories
✅ Publisher: Successfully committed and pushed to blog repository

News Classification System

The system automatically categorizes news into 9 tech domains:

🤖 Artificial Intelligence
📱 Mobile Technology
🚗 Autonomous Driving
☁️ Cloud Computing
💾 Chip Technology
💰 Venture Capital
🔒 Cybersecurity
⛓️ Blockchain
🔬 Scientific Research

Output Format Example

---
title: Daily News Digest · 2025-07-26
pubDate: '2025-07-26'
description: In 2025, the US semiconductor market experienced significant changes...
tags: [News, Daily, Chip Technology, Autonomous Driving, Mobile Technology]
layout: news
---

## Chip Technology

- **A timeline of the US semiconductor market in 2025**
  In 2025, the US semiconductor market experienced significant changes, including leadership transitions at traditional semiconductor companies and volatile chip export policies.
  *Tags: Semiconductor · US Market · Policy Changes*
  [Read Original](https://techcrunch.com/2025/07/25/...) | Source: TechCrunch

## Autonomous Driving

- **Tesla is reportedly bringing robotaxi service to San Francisco**
  Tesla plans to launch a limited version of its autonomous taxi service in San Francisco. Unlike the Austin service, this test will have employees in the driver's seat for safety.
  *Tags: Tesla · Autonomous Driving · Taxi Service*
  [Read Original](https://techcrunch.com/2025/07/25/...) | Source: TechCrunch

Future Plans

Smarter Source Discovery: Enable the bot to automatically discover and recommend new high-quality news sources
Trend Analysis & Topic Aggregation: Identify hot topics within specific time periods and aggregate related articles
User Feedback Loop: Collect user feedback data for fine-tuning AI models
Open Source Plan: Clean up code and open source to help others build their own AI news bots

Conclusion

astro-news-bot is a typical “use technology to solve your own problems” project. It organically combines AI, automation scripts, modern web development frameworks (Astro), and DevOps principles (GitOps) to build an elegant automated system.

This project not only solves my information overload problem but also serves as an excellent testbed for practicing LLM applications, vector databases, GitOps, and other emerging technologies. If you want to build a similar system, I hope this article provides some inspiration and reference.

Key Tech Stack:

Backend: Python + OpenAI API + SentenceTransformers
Frontend: Astro + TypeScript + Content Collections
Deployment: GitOps + Shell Scripts + Cron Jobs
Data: JSON + Markdown + Git

The entire system embodies modern AI application development best practices: modular design, vectorized processing, automated deployment, and continuous operations.

Ge Yuxu • AI & Engineering

脱敏说明：本文所有出现的表名、字段名、接口地址、变量名、IP地址及示例数据等均非真实，仅用于阐述技术思路与实现步骤，示例代码亦非公司真实代码。示例方案亦非公司真实完整方案，仅为本人记忆总结，用于技术学习探讨。
    • 文中所示任何标识符并不对应实际生产环境中的名称或编号。
    • 示例 SQL、脚本、代码及数据等均为演示用途，不含真实业务数据，也不具备直接运行或复现的完整上下文。
    • 读者若需在实际项目中参考本文方案，请结合自身业务场景及数据安全规范，使用符合内部命名和权限控制的配置。

Data Desensitization Notice: All table names, field names, API endpoints, variable names, IP addresses, and sample data appearing in this article are fictitious and intended solely to illustrate technical concepts and implementation steps. The sample code is not actual company code. The proposed solutions are not complete or actual company solutions but are summarized from the author's memory for technical learning and discussion.
    • Any identifiers shown in the text do not correspond to names or numbers in any actual production environment.
    • Sample SQL, scripts, code, and data are for demonstration purposes only, do not contain real business data, and lack the full context required for direct execution or reproduction.
    • Readers who wish to reference the solutions in this article for actual projects should adapt them to their own business scenarios and data security standards, using configurations that comply with internal naming and access control policies.

版权声明：本文版权归原作者所有，未经作者事先书面许可，任何单位或个人不得以任何方式复制、转载、摘编或用于商业用途。
    • 若需非商业性引用或转载本文内容，请务必注明出处并保持内容完整。
    • 对因商业使用、篡改或不当引用本文内容所产生的法律纠纷，作者保留追究法律责任的权利。

Copyright Notice: The copyright of this article belongs to the original author. Without prior written permission from the author, no entity or individual may copy, reproduce, excerpt, or use it for commercial purposes in any way.
    • For non-commercial citation or reproduction of this content, attribution must be given, and the integrity of the content must be maintained.
    • The author reserves the right to pursue legal action against any legal disputes arising from the commercial use, alteration, or improper citation of this article's content.

Copyright © 1989–Present Ge Yuxu. All Rights Reserved.