# 🏥 MEDICAL TRANSCRIPT ENTITY EXTRACTION SYSTEM
## Complete Implementation Guide

**Purpose**: Extract structured medical entities from doctor-patient conversation transcripts using local LLM (Llama 3.1 / Gemma) with sentiment analysis.

**Created for**: Analysis of diabetic retinopathy patient conversations

---

## 📋 WHAT THIS SYSTEM DOES

✅ **Extracts 24+ medical entities** from conversations:
- Symptoms (5 types)
- Ophthalmic findings (7 types)
- Diagnostic tools (4 types)
- Risk factors (3 types)
- Treatment options (3 types)
- Demographics/history (3 types)

✅ **Performs sentiment analysis** for each entity (positive/neutral/negative)

✅ **Captures patient perspectives**:
- Concerns and worries
- Goals and hopes
- Occurrence patterns
- Severity assessments

✅ **Identifies doctor's questions** across conversations

✅ **Generates comprehensive outputs**:
- Excel spreadsheet with 5 sheets (main deliverable)
- JSON data for further processing
- Text report with statistics

✅ **Compares multiple LLM models** (Llama 3.1 vs Gemma)

✅ **Uses few-shot prompting** for accuracy

---

## 📂 FILES IN THIS PACKAGE

```
/home/claude/
├── medical_transcript_analyzer.py    # Main analysis engine
├── run_analysis.py                   # Simple runner script
├── quick_start.sh                    # Automated setup & run
├── SETUP_GUIDE.md                    # Detailed setup instructions
├── TROUBLESHOOTING.md                # Common issues & solutions
├── OUTPUT_DOCUMENTATION.md           # Expected output format
└── README.md                         # This file
```

---

## 🚀 QUICK START (3 METHODS)

### Method 1: Automated Script (Easiest)
```bash
./quick_start.sh
```
This will:
- Install Ollama
- Download model
- Install Python packages
- Start analysis automatically

### Method 2: Interactive Runner
```bash
python3 run_analysis.py
```
Follow the prompts to select model and start analysis.

### Method 3: Direct Execution
```bash
# Make sure Ollama is running
ollama serve &

# Run analysis with Llama 3.1
python3 medical_transcript_analyzer.py
```

---

## 📦 PREREQUISITES

### System Requirements
- **OS**: Linux, Mac, or Windows
- **RAM**: 8GB minimum (16GB recommended)
- **Storage**: 10GB free space
- **Python**: 3.8 or higher

### Software Requirements
1. **Ollama** (Local LLM runtime)
2. **Python packages**: PyPDF2, pandas, openpyxl, requests

---

## 🔧 INSTALLATION STEPS

### Step 1: Install Ollama
```bash
# Linux/Mac
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from: https://ollama.com/download/windows
```

### Step 2: Download LLM Model
```bash
# Llama 3.1 8B (Recommended - 4.7GB)
ollama pull llama3.1:8b

# OR Google Gemma 7B (Alternative - 5.0GB)
ollama pull gemma:7b
```

### Step 3: Install Python Packages
```bash
pip install PyPDF2 pandas openpyxl requests numpy
```

### Step 4: Prepare PDF Files
- Place all PDF files in one directory
- Default location: `/mnt/user-data/uploads`
- Or specify custom location when running

### Step 5: Start Ollama Server
```bash
ollama serve
```
Keep this running in a separate terminal!

### Step 6: Run Analysis
```bash
python3 run_analysis.py
```

---

## 📊 OUTPUT FILES

After analysis completes, you'll get **3 files**:

### 1. **Excel File** (Main Deliverable) ⭐
`ANALYSIS_TABULATION_[model]_[timestamp].xlsx`

**5 sheets**:
1. **Summary**: Overview statistics
2. **Entity_Matrix**: Yes/No presence table for all entities
3. **Detailed_Extractions**: Full context with quotes
4. **Doctor_Questions**: Questions asked by doctors
5. **Patient_Perspectives**: Concerns, goals, severity

**This answers ALL your sir's questions!**

### 2. **JSON File** (Raw Data)
`extracted_data_[model]_[timestamp].json`

Complete structured data for custom analysis.

### 3. **Text Report** (Summary)
`ANALYSIS_REPORT_[model]_[timestamp].txt`

Human-readable statistics and insights.

---

## 🎯 WHAT YOUR SIR GETS (His Requirements)

### ✅ Question 1: Total number of unique conversations
**Answer in**: Summary sheet → "Total Conversations"

### ✅ Question 2: Main questions asked of each patient
**Answer in**: Doctor_Questions sheet

### ✅ Question 3: All entities with patient perspectives
**Answer in**: 
- Entity_Matrix sheet (presence/absence)
- Detailed_Extractions sheet (full context)
- Patient_Perspectives sheet (concerns/goals/severity)

### ✅ Additional: Sentiment analysis
**Answer in**: All sheets have sentiment columns

### ✅ Additional: Model comparison
Run with both models, compare Excel outputs

---

## 🔍 EXAMPLE USAGE

### Basic Analysis
```bash
# With Llama 3.1
python3 run_analysis.py
# Select option 1
# Enter PDF directory: /path/to/pdfs
```

### Compare Both Models
```bash
python3 run_analysis.py
# Select option 3 (Both)
```

### Custom Configuration
```python
from medical_transcript_analyzer import MedicalTranscriptAnalyzer

# Initialize with specific model
analyzer = MedicalTranscriptAnalyzer(model_name="gemma:7b")

# Analyze specific directory
results = analyzer.analyze_all_transcripts("/custom/path/to/pdfs")

# Generate outputs
analyzer.create_tabulated_output(results, "output.xlsx")
```

---

## ⏱️ EXPECTED TIMELINE

| Task | Time |
|------|------|
| Initial setup (first time only) | 30 min |
| Model download (first time only) | 10-15 min |
| Processing 1 PDF | 1-2 min |
| Processing 25 PDFs | 25-30 min |
| **Total (first run)**: | **60-75 min** |
| **Subsequent runs**: | **25-30 min** |

With GPU: 50-70% faster

---

## 🛠️ CUSTOMIZATION

### Add More Entities
Edit `medical_transcript_analyzer.py`:
```python
self.entities = {
    "Symptoms": [
        "Blurred vision",
        "Your new symptom here",  # Add here
    ],
    # ... rest
}
```

### Adjust Model Parameters
```python
'temperature': 0.1,  # Lower = more consistent
'top_p': 0.9,        # Nucleus sampling
```

### Add More Few-Shot Examples
Edit `create_few_shot_prompt()` method to include more examples.

---

## 🐛 TROUBLESHOOTING

Common issues and solutions are documented in:
- **`TROUBLESHOOTING.md`** - Comprehensive troubleshooting guide

Quick fixes:
- **"Cannot connect to Ollama"**: Run `ollama serve`
- **"Model not found"**: Run `ollama pull llama3.1:8b`
- **"No text extracted"**: Check if PDFs are scanned (need OCR)
- **"JSON parsing error"**: Check `/tmp/llm_debug.txt` for LLM output

---

## 📈 PERFORMANCE TIPS

1. **Use GPU if available**:
   ```bash
   ollama run llama3.1:8b --gpu
   ```

2. **Process in batches** for large datasets

3. **Use smaller model** if low on RAM:
   ```bash
   ollama pull llama3.1:3b  # Instead of 8b
   ```

4. **Truncate long transcripts** to speed up processing

---

## 📚 DOCUMENTATION

- **`SETUP_GUIDE.md`**: Step-by-step installation
- **`TROUBLESHOOTING.md`**: Common issues & fixes
- **`OUTPUT_DOCUMENTATION.md`**: Detailed output explanation
- **`README.md`**: This overview

---

## ✅ VALIDATION CHECKLIST

Before presenting results:

- [ ] All PDF files processed successfully
- [ ] Excel file opens without errors
- [ ] Entity_Matrix has data for all conversations
- [ ] Spot-check 3-5 random extractions for accuracy
- [ ] Sentiments are reasonable
- [ ] Patient concerns/goals captured
- [ ] Doctor questions extracted
- [ ] No missing or corrupted data

---

## 🎓 UNDERSTANDING THE APPROACH

### Few-Shot Prompting
The system includes example input-output pairs in the prompt to teach the LLM:
- What entities to look for
- How to format responses
- What level of detail to extract
- How to assess sentiment

### Entity Extraction Strategy
1. Define all entities upfront
2. For each transcript:
   - Extract text from PDF
   - Build prompt with few-shot examples
   - Call LLM to extract entities
   - Parse JSON response
   - Validate and structure data
3. Aggregate results across all transcripts
4. Generate tabulated outputs

### Sentiment Analysis
- **Positive**: Patient expresses hope, improvement
- **Neutral**: Factual statements, no emotion
- **Negative**: Concerns, worsening, pain, worry

---

## 🔬 ACCURACY IMPROVEMENT TIPS

1. **Add domain-specific examples** to few-shot prompt
2. **Make entity definitions more specific**
3. **Use chain-of-thought prompting**
4. **Validate with multiple models**
5. **Manual review of sample outputs**
6. **Iteratively refine prompts based on results**

---

## 📞 SUPPORT & FEEDBACK

If you encounter issues:

1. Check `TROUBLESHOOTING.md`
2. Review system logs:
   ```bash
   journalctl -u ollama -f
   ```
3. Run in debug mode:
   ```bash
   python3 -u run_analysis.py 2>&1 | tee debug.log
   ```
4. Test with single PDF first

---

## 🎯 SUCCESS CRITERIA

Your implementation is successful when:

✅ All PDFs processed without errors
✅ Excel output matches manual verification
✅ Entities correctly marked as present/absent
✅ Sentiments align with conversation tone
✅ Patient perspectives accurately captured
✅ Output format matches sir's requirements
✅ Can compare results from different models

---

## 📝 NEXT STEPS

After initial analysis:

1. **Review outputs** for accuracy
2. **Validate** with 5-10 random samples
3. **Compare models** (Llama vs Gemma)
4. **Present results** to your sir
5. **Iterate** based on feedback
6. **Scale up** to full dataset if satisfied

---

## 🚀 READY TO START?

Choose your method:

```bash
# Easiest: Automated
./quick_start.sh

# Interactive: Step-by-step
python3 run_analysis.py

# Advanced: Custom code
python3 medical_transcript_analyzer.py
```

**Good luck with your analysis! 🎉**

---

## 📄 LICENSE & ATTRIBUTION

This system uses:
- **Ollama**: Apache 2.0 License
- **Llama 3.1**: Meta License
- **Gemma**: Google License
- **Python packages**: Various open-source licenses

**Note**: Ensure compliance with data privacy regulations when processing medical transcripts.

---

**Version**: 1.0
**Last Updated**: January 28, 2026
**Author**: Medical Transcript Analysis System
