Skip to content

George-API/ETL_Pipeline_Machine_Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mindmodel.io

Project Overview

Mindmodel.io is an AI-driven cognitive assessment application designed to democratize access to personalized cognitive insights. The platform provides low-cost, accessible alternatives to traditional cognitive assessments, helping users understand their unique cognitive profiles without relying on conventional IQ metrics.

Our platform leverages a custom fine-tuned AI model specifically trained on cognitive science research. The model was created through a structured two-phase training approach to ensure deep domain expertise and accurate cognitive assessment capabilities.

Key Features

  • Interactive cognitive tests measuring memory, attention, processing speed, and executive functions
  • Personalized cognitive profiles highlighting individual strengths and areas for development
  • AI-driven analysis providing tailored strategies and recommendations
  • Progress tracking to monitor cognitive development over time

Purpose

Mindmodel.io empowers users with constructive, non-judgmental feedback and actionable recommendations for academic, professional, and personal optimization. By leveraging insights about different learning styles and cognitive capabilities, users can improve their learning, productivity, and problem-solving approaches.

Fine-Tuning Goals and Requirements

Training Approach

Our model training follows a structured two-phase approach:

Phase 1: Domain Adaptation (Unsupervised)

In this initial phase, we adapted the DeepSeek-R1-Distill-Llama-8B model to the cognitive science domain by exposing it to a large corpus of cognitive science texts. This phase helped the model learn domain-specific vocabulary, concepts, and patterns.

Key aspects:

  • Sequential processing of research papers to maintain context
  • 4-bit quantization and LoRA for memory-efficient training
  • Regular checkpointing and detailed metric monitoring
  • Resulted in our domain-adapted model: George-API/DeepSeek-Cognitive-Science

Phase 2: Supervised Fine-Tuning (SFT)

Building on Phase 1, we performed supervised fine-tuning to teach the model to understand complex multidiscinplinary question answer pairing on cognitive science topics.

Key aspects:

  • Used instruction-response pairs covering various cognitive science topics
  • Lower learning rate (1e-5) for stable fine-tuning
  • Focus on question answering and explanation capabilities
  • Resulted in our final model: George-API/DeepSeek-Cognitive-Science-SFT

Data Processing Pipeline

Our training data was processed through a comprehensive pipeline:

  1. Collection and Extraction: Raw research papers processed from PDFs to text
  2. Cleaning and ID Assignment: Structured text with unique identifiers
  3. Segmentation: Papers divided into manageable chunks (1,500-2,000 tokens)
  4. Metadata and Tag Assignment: Added detailed cognitive science tags and relationships
  5. Final Optimization: Converted to training-ready JSONL format

The pipeline created a hierarchical knowledge graph with:

  • Primary categories: Cognitive Domains, Thinking Styles, Contexts, Neurodevelopmental Conditions, Applied Cognition
  • Complex relationships between concepts (hierarchical, directional, bidirectional)
  • Integration of multiple disciplinary perspectives

Domain Focus

The training data focuses on interdisciplinary cognitive science research with emphasis on:

  • Cognitive processes and mental models
  • Neuroscience and brain function
  • Psychological frameworks
  • Educational applications
  • Neurodiversity understanding
  • Workplace/organizational applications

Core Objectives

  1. Deep Domain Expertise

    • Understanding complex cognitive and neural mechanisms
    • Integrating multiple theoretical frameworks
    • Maintaining scientific accuracy and rigor
    • Handling specialized terminology correctly
  2. Interdisciplinary Integration

    • Connecting insights across disciplines:
      • Cognitive Science ↔ Neuroscience
      • Psychology ↔ Education
      • Theory ↔ Practice
    • Understanding cross-domain implications
    • Synthesizing research findings
  3. Nuanced Personalized Insight

    • Distinguishing between similar concepts
    • Understanding context-dependent interpretations
    • Recognizing subtle differences in:
      • Cognitive processes
      • Learning approaches
      • Individual differences
      • Neurodiversity manifestations

Training Focus Areas

Based on metadata keywords:

  1. Cognitive Processes

    • Attention types (sustained, selective, divided)
    • Memory systems
    • Executive functions
    • Problem-solving approaches
    • Decision-making processes
  2. Individual Differences

    • Learning styles
    • Cognitive diversity
    • Neurodevelopmental variations
    • Personal strengths and challenges
  3. Practical Applications

    • Educational strategies
    • Workplace accommodations
    • Support systems
    • Assessment approaches
    • Intervention methods

Success Metrics

The fine-tuned model should demonstrate:

  1. Technical Accuracy

    • Correct use of scientific terminology
    • Accurate representation of research findings
    • Proper citation and reference understanding
  2. Contextual Understanding

    • Appropriate application of theories to specific situations
    • Recognition of individual differences
    • Awareness of contextual factors
  3. Practical Insight

    • Actionable recommendations
    • Evidence-based solutions
    • Real-world application understanding

Training Considerations

  1. Data Quality

    • Scientific papers and research findings
    • Peer-reviewed content
    • Current theoretical frameworks
    • Evidence-based practices
  2. Balance

    • Theory vs. practical application
    • Different disciplinary perspectives
    • Various cognitive domains
    • Diverse population considerations
  3. Ethical Considerations

    • Neurodiversity-affirming approach
    • Individual difference respect
    • Cultural sensitivity
    • Evidence-based recommendations

Implementation Notes

  • Training data has been cleaned and structured
  • Categories have been removed for more flexible learning
  • Focus on maintaining scientific rigor while ensuring practical applicability
  • Emphasis on integrating multiple perspectives and approaches

flowchart TD classDef dataProcessing fill:#d4f1f9,stroke:#05a,stroke-width:1px classDef phase1 fill:#d5f5d5,stroke:#070,stroke-width:1px classDef phase2Prep fill:#fff2cc,stroke:#d6b600,stroke-width:1px classDef phase2SFT fill:#ffe0cc,stroke:#d66b00,stroke-width:1px classDef webApp fill:#e1d4f9,stroke:#5503a9,stroke-width:1px

subgraph DP ["Data Processing Pipeline"]
    A[Research Paper Collection] --> B[PDF Extraction & Cleaning]
    B --> C[Text Segmentation & Chunking]
    C --> D[Metadata & Tag Assignment]
    D --> E[Knowledge Graph Creation]
    E --> F[Training-Ready JSONL Format]
end

subgraph P1 ["Phase 1: Domain Adaptation"]
    G[DeepSeek-R1-Distill-Llama-8B Base Model] --> H[Unsupervised Training with Cognitive Science Corpus]
    H --> I[Domain-Adapted Model:<br>George-API/DeepSeek-Cognitive-Science]
end

subgraph P2P ["Phase 2 Preparation"]
    J[Question Collection for Cognitive Science Topics] --> K[Summarization Model Generates Expert Responses]
    K --> L[Format as Instruction-Response Pairs]
end

subgraph P2S ["Phase 2: Supervised Fine-Tuning (SFT) with LoRA"]
    SP[" "]:::hidden
    SP --> M[Load Domain-Adapted Model]
    M --> N[Supervised Fine-Tuning with Cognitive Science Q&A Pairs]
    N --> O[Final SFT Model:<br>George-API/DeepSeek-Cognitive-Science-SFT]
end

subgraph WA ["Web Application Integration"]
    P[API Integration & Deployment] --> Q[Interactive Cognitive Assessments]
    Q --> R[Cognitive Profile Analysis]
    Q --> S[Tailored Strategy & Advice]
    R --> T[User Dashboard & Progress Tracking]
    S --> T
end

%% Connecting the main subgraphs
DP --> P1
DP --> P2P
P1 --> P2S
P2P --> P2S
P2S --> WA

%% Apply classes
class DP dataProcessing
class P1 phase1
class P2P phase2Prep
class P2S phase2SFT
class WA webApp
class SP hidden

classDef hidden fill:none,stroke:none

About

A comprehensive training pipeline for fine-tuning the DeepSeek-R1-Distill-Llama-8B model on cognitive science research.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages