web-scraping

How I Trained a Chatbot on GitHub Repositories Using an AI Scraper and LLM

I built an AI-powered chatbot that can answer questions about GitHub repositories by extracting key insights from repository data. I used Bright Data’s Data for AI Web Scraper to collect structured data and trained a chatbot using Ollama’s Phi3 model to analyze and interact with the data.

In this article, I’ll walk you through:

✅ How I obtained GitHub repository data using Bright Data’s Github Data for AI Web Scraper. ✅ Training a chatbot with Ollama’s Phi3 model. ✅ Implementing a Streamlit-based GitHub Insights Tool for real-time interactions. ✅ Lessons learned and the impact of using AI for repository analysis.

How I Obtained GitHub Datasets Using Bright Data

To train the chatbot, I needed a high-quality dataset containing key repository details. Instead of scraping GitHub manually, I used Bright Data’s AI Scraper, which provided a structured and automated way to collect repository data.

They have two methods for scraping data from any website using their Web Scrapers. The web scrapers have the Scraper API and the No-Code Scraper that anyone can use.

Buy Datasets - Marketplace & Custom Datasets

Steps to extract GitHub data using Bright Data Web Scraper

  1. Sign up on Bright Data and click Web Scrapers in the left pane.

If you are a new user just signing in, you will get a free $5 to try their services for 7 days.

2. Search for “GitHub” in the search bar and click on the first result.

3. A list of GitHub scrapers will appear. Select “GitHub Repository — Collect by URL” for this use case.

4. Select the No-Code Scraper.

5. Click “Add Input” to add your required GitHub repository links, then click “Start Collecting”.

6. Once the status field shows “Ready, " click “Download” and choose CSV as the format.

Building a GitHub Insights Tool

This project uses Python for data processing and Streamlit for a simple UI.

Prerequisites

  • Any code editor of your choice.
  • Python installed (version 3.8+ recommended).

Step 1: Setting Up the Project

  1. Create the project folder:
mkdir github-insights-tool
cd github-insights-tool

2. Set up a virtual environment:

python -m venv venv

3. Activate the environment:

  • Windows:
venv\Scripts\activate
  • macOS/Linux:
source venv/bin/activate

4. Install dependencies:

pip install pandas streamlit langchain_community

Streamlit — For building the UI

  • Pandas — For handling dataset operations
  • LangChain (Ollama) — For AI-driven repository analysis

Project structure:

github-insights-tool/
│── github.csv #your dataset from Bright Data
│── ai.py

Step 2: Installing and Running the Chatbot (Ollama Phi3 Model) Locally

This AI-powered tool generates insights about GitHub repositories by analyzing their strengths, weaknesses, and usability. It also provides key repository details without requiring you to navigate multiple sections on GitHub.

Why Ollama?

  • Free and easy to set up
  • Runs locally without internet dependency
  • Provides fast and customizable responses

Installing Ollama

Ollama provides a simple CLI tool to run large language models (LLMs) locally. Install it based on your operating system:

  • Windows (PowerShell):
curl -LO https://ollama.com/download/latest/windows && start ollama.exe
  • Linux (Curl):
curl -fsSL https://ollama.ai/install.sh | sh
  • macOS (Homebrew):
brew install ollama

Download the Phi3 Model:

ollama pull phi3

Run the Ollama Model:

ollama run phi3

💡 Note: Always ensure the Ollama model is running locally before executing your code. Otherwise, the AI model won’t be accessible.

Step 3: Implementing the GitHub Insights Tool

The tool consists of the following functionalities:

Initializing Ollama

import streamlit as st
import pandas as pd
from langchain_ollama import OllamaLLM


# Initialize Ollama with the chosen model
llm = OllamaLLM(model="phi3")

Loading the GitHub Database

@st.cache_data
def load_github_data():
    df = pd.read_csv("githubdata.csv")
    df.columns = df.columns.str.strip().str.lower()  # Normalize column names to lowercase
    return df

Analyzing the Desired Repository Using AI

def analyze_repository(repo_data, llm):
    prompt = f"""
    Analyze the following GitHub repository data and provide insights:
    {repo_data.to_dict()}
    Focus on:
    1. Code quality and maintainability
    2. Popularity and engagement
    3. Potential use cases
    4. Key strengths and weaknesses
    """
    try:
        return llm.invoke(prompt)
    except Exception as e:
        return f"Error generating analysis: {e}"

This function generates insights based on code quality, engagement, and potential use cases.

Interacting with the AI-Generated Analysis

def interact_with_analysis(analysis, query, llm):
    prompt = f"""
    Based on the following analysis:
    {analysis}
    Answer the user's query: {query}
    """
    try:
        return llm.invoke(prompt)
    except Exception as e:
        return f"Error processing query: {e}"

This allows users to interact with AI-generated analysis of the repository.

Step 4: Defining the Streamlit Application

Core Features

  • Allows users to enter a GitHub URL (This is any of the URLs present in the CSV file, so it can provide answers tailored to that specific GitHub repository).
  • Initiates AI chatbot interaction based on analysis.
def main():
    # Add GitHub logo next to the title
    st.markdown("""<h1 style='display: flex; align-items: center;'>
        <img src='https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png' width='40' style='margin-right:10px;'>
        GitHub Repository Insights Tool
        </h1>""", unsafe_allow_html=True)
   
    github_df = load_github_data()
   
    # User input field for entering a GitHub repository URL
    repo_url = st.text_input("Enter GitHub Repository URL")
    analysis_result = ""
   
    if repo_url:
        # Filter the dataset based on the entered URL
        repo_data = github_df[github_df["url"] == repo_url]
        if not repo_data.empty:
            repo_data = repo_data.iloc[0]
           
            # Display repository details
            st.subheader("Repository Details")
            st.write(f"Language: {repo_data['code_language']}")
            st.write(f"Stars: {repo_data['num_stared']}")
            st.write(f"Forks: {repo_data['num_fork']}")
            st.write(f"Pull Requests: {repo_data['num_pull_requests']}")
            st.write(f"Last Feature: {repo_data['last_feature']}")
            st.write(f"Latest Update: {repo_data['latest_update']}")
           
            # Display repository owner details
            st.subheader("Owner Details")
            st.write(f"Owner: {repo_data['user_name']}")
            st.write(f"URL: {repo_data['url']}")
           
            # AI-powered analysis of the repository
            st.subheader("AI Analysis")
            if st.button("Generate Analysis"):
                with st.spinner("Analyzing repository..."):
                    analysis_result = analyze_repository(repo_data, llm)
                    st.session_state["analysis"] = analysis_result  # Store analysis in session state
                    st.write(analysis_result)
        else:
            st.warning("Repository not found in the dataset. Please enter a valid URL.")
   
    # AI Chatbot interaction based on the generated analysis
    if "analysis" in st.session_state:
        st.subheader("Chat with AI about this Repository")
        user_query = st.text_input("Ask a question about the repository analysis")
        if user_query:
            with st.spinner("Processing query..."):
                response = interact_with_analysis(st.session_state["analysis"], user_query, llm)
                st.write(response)


# Run the Streamlit application
if __name__ == "__main__":
    main()

Running the application:

On your terminal, run this command:

streamlit run app.py

Step 5: Using the GitHub Insights Tool Application

  1. Paste the repository URL and view the analytics.

2. Click “Generate Analysis” to develop a report of the repository.

3. Interact with the chatbot to gain further insights.

Conclusion

Training a chatbot on GitHub repositories using Data for AI Web Scraper from Bright Data and Ollama Phi3 proved highly effective for automating repository insights. This approach saves time, improves accuracy, and provides AI-powered responses based on real repository data.

For developers looking for clean, structured GitHub datasets, Bright Data offers reliable, ready-made datasets and API integration to streamline data extraction and analysis.

🚀 Try it out and let me know your thoughts!

Frequently Asked Questions

Common questions about this topic

What does the GitHub Insights Tool do?

The GitHub Insights Tool analyzes GitHub repositories by loading structured repository data, generating AI-powered analysis of code quality, popularity, potential use cases, strengths and weaknesses, and enabling interactive Q&A about that analysis via a chatbot interface.

How is GitHub repository data obtained for the tool?

Repository data is collected using Bright Data’s Data for AI Web Scraper (the GitHub Repository — Collect by URL scraper), using the No-Code Scraper to add repository URLs, start collection, and download results as CSV.

Which Bright Data scraping interfaces are mentioned for collecting data?

Bright Data’s Scraper API and the No-Code Scraper are mentioned as the two methods their web scrapers provide for scraping websites.

What file format is used to import repository data into the project?

The repository data is downloaded from Bright Data as a CSV file and loaded into the project (expected as githubdata.csv or github.csv in the project folder).

What are the project prerequisites and dependencies?

Prerequisites are a code editor and Python (version 3.8+ recommended). Dependencies installed via pip are pandas, streamlit, and langchain_community.

What is the required project structure to run the tool?

The expected project structure is a folder named github-insights-tool containing the dataset file (github.csv) and an ai.py (or app.py) script that implements the Streamlit application and AI integration.

Why is Ollama used for the AI model in this project?

Ollama is used because it is free, easy to set up, runs locally without internet dependency, and provides fast, customizable responses suitable for local LLM inference.

How is the Ollama Phi3 model installed and run locally?

Ollama is installed using the platform-specific commands provided (PowerShell curl for Windows, install script for Linux, or Homebrew for macOS), then the Phi3 model is pulled with 'ollama pull phi3' and started with 'ollama run phi3'.

How does the Streamlit app connect to the local Ollama model?

The Streamlit app initializes an OllamaLLM instance with model='phi3' (from langchain_ollama), and the app invokes the model via that client when generating analysis and answering follow-up queries.

How does a user generate an AI analysis for a specific repository in the app?

A user pastes a repository URL that exists in the CSV dataset into the app’s input field, the app filters the dataset for that URL, displays repository and owner details, and when the user clicks 'Generate Analysis' the app calls analyze_repository to invoke the LLM and produce the AI analysis.

What interactive capability does the app provide after generating analysis?

After generating analysis, the app stores the analysis in session state and provides a chatbot interface where users can type queries about the analysis; those queries are answered by calling the LLM with the analysis plus the user query.

How is the repository dataset loaded and normalized in the application code?

The dataset is loaded with pandas.read_csv inside a cached function, and column names are normalized by stripping whitespace and converting to lowercase via df.columns = df.columns.str.strip().str.lower().

What command starts the Streamlit application?

The Streamlit application is started from the terminal with the command 'streamlit run app.py'.

What happens if the entered repository URL is not found in the dataset?

If the entered repository URL is not found in the dataset, the application displays a warning stating 'Repository not found in the dataset. Please enter a valid URL.'

What error handling is implemented for LLM invocations in the code snippets?

LLM invocations in analyze_repository and interact_with_analysis are wrapped in try/except blocks that return an error message string formatted as 'Error generating analysis: {e}' or 'Error processing query: {e}' if an exception occurs.