# COMPLETE SETUP GUIDE FOR MEDICAL TRANSCRIPT ANALYSIS
# Step-by-Step Instructions

## PREREQUISITES
Before starting, ensure you have:
- Python 3.8 or higher
- At least 8GB RAM (16GB recommended for LLMs)
- PDF files of doctor-patient conversations

## STEP 1: INSTALL OLLAMA (Local LLM Runtime)

### For Linux:
```bash
curl -fsSL https://ollama.com/install.sh | sh
```

### For Windows:
Download from: https://ollama.com/download/windows

### For Mac:
```bash
brew install ollama
```

## STEP 2: INSTALL REQUIRED PYTHON PACKAGES

```bash
pip install --break-system-packages PyPDF2 pandas openpyxl requests numpy
```

## STEP 3: DOWNLOAD AND RUN LLM MODELS

### Option A: Llama 3.1 8B (Recommended - 4.7GB)
```bash
ollama pull llama3.1:8b
ollama run llama3.1:8b
```

### Option B: Google Gemma 7B (Alternative - 5.0GB)
```bash
ollama pull gemma:7b
ollama run gemma:7b
```

**IMPORTANT**: Keep the Ollama terminal running in background!

## STEP 4: PREPARE YOUR PDF FILES

1. Create a folder for your PDFs (or use existing)
2. Make sure all PDF files are in one directory
3. Note the directory path

Example structure:
```
/home/user/medical_transcripts/
    ├── dr_BN1103.pdf
    ├── dr_CJ0406.pdf
    ├── dr_CL0110.pdf
    └── ... (more PDFs)
```

## STEP 5: UPDATE CONFIGURATION

Edit the `medical_transcript_analyzer.py` file:

Find this line (around line 395):
```python
PDF_DIRECTORY = "/mnt/user-data/uploads"
```

Change it to your PDF directory:
```python
PDF_DIRECTORY = "/home/user/medical_transcripts"  # Your actual path
```

## STEP 6: RUN THE ANALYZER

### Method 1: Interactive Mode
```bash
python3 medical_transcript_analyzer.py
```

Follow the prompts:
1. Select model (1 for Llama, 2 for Gemma)
2. Press Enter to start analysis
3. Wait for completion

### Method 2: Quick Run (Llama 3.1)
```bash
cd /home/claude
python3 -c "
from medical_transcript_analyzer import MedicalTranscriptAnalyzer
import os
from datetime import datetime

analyzer = MedicalTranscriptAnalyzer(model_name='llama3.1:8b')
results = analyzer.analyze_all_transcripts('/mnt/user-data/uploads')

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
output_dir = '/home/claude/output'
os.makedirs(output_dir, exist_ok=True)

# Save outputs
import json
with open(f'{output_dir}/data_{timestamp}.json', 'w') as f:
    json.dump(results, f, indent=2)

analyzer.create_tabulated_output(results, f'{output_dir}/analysis_{timestamp}.xlsx')
analyzer.generate_analysis_report(results, f'{output_dir}/report_{timestamp}.txt')

print(f'Analysis complete! Check {output_dir}/')
"
```

## STEP 7: UNDERSTAND THE OUTPUTS

You will get 3 files:

### 1. JSON File (`extracted_data_TIMESTAMP.json`)
- Raw extracted data in JSON format
- Contains all entities, sentiments, and details
- Useful for further processing

### 2. Excel File (`analysis_tabulation_TIMESTAMP.xlsx`)
**This is the main deliverable your sir wants!**

Contains 5 sheets:

**Sheet 1 - Summary**: Overview statistics
- Total conversations
- Total documents
- Model used
- Analysis date

**Sheet 2 - Entity Matrix**: Yes/No presence table
- Rows: Each conversation/document
- Columns: Each entity (present/absent)
- Additional columns: Sentiment for each entity

**Sheet 3 - Detailed Extractions**: Full details
- Conversation ID
- Entity name
- Sentiment (positive/neutral/negative)
- Details (quotes from conversation)
- Occurrence patterns
- Severity levels

**Sheet 4 - Doctor Questions**: Questions asked
- All questions extracted per conversation
- Identifies common questions

**Sheet 5 - Patient Perspectives**: Patient info
- Patient concerns
- Patient goals/hopes
- Overall severity assessment

### 3. Text Report (`analysis_report_TIMESTAMP.txt`)
- Human-readable summary
- Statistics and insights
- Most common entities
- Sentiment distribution

## STEP 8: VERIFY ACCURACY

To ensure accuracy, check a few random samples:

1. Open a PDF transcript
2. Find the same conversation in Excel (Sheet 3)
3. Verify the extracted entities match the PDF
4. Check if sentiments are correct
5. Confirm patient concerns/goals are accurate

## TROUBLESHOOTING

### Problem: "Connection refused" or "Model not found"
**Solution**: Make sure Ollama is running
```bash
# In a separate terminal
ollama serve
# In another terminal
ollama run llama3.1:8b
```

### Problem: "No text extracted from PDF"
**Solution**: 
- Check if PDFs are text-based (not scanned images)
- If scanned, you need OCR (pytesseract)
- Verify PDF files are not corrupted

### Problem: "JSON parsing error"
**Solution**: 
- The LLM response wasn't valid JSON
- Try lowering temperature (already set to 0.1)
- Try different model (switch between Llama and Gemma)

### Problem: "Out of memory"
**Solution**: 
- Process fewer files at once
- Use smaller model (gemma:2b)
- Increase system RAM

### Problem: Slow processing
**Solution**: 
- Normal for first run (model loading)
- Each PDF takes 30-60 seconds
- For 25 PDFs, expect ~20-30 minutes total

## STEP 9: COMPARE MODELS

To compare Llama 3.1 vs Gemma:

```bash
# Run with Llama
python3 medical_transcript_analyzer.py
# Select option 1

# Run with Gemma
python3 medical_transcript_analyzer.py
# Select option 2

# Compare the Excel outputs
# Check which model gives more accurate results
```

## STEP 10: ADVANCED - IMPROVE ACCURACY WITH FEW-SHOT

The script already includes few-shot examples in the prompt.

To add more examples:
1. Open `medical_transcript_analyzer.py`
2. Find the `create_few_shot_prompt()` method
3. Add more examples following the same format

Example:
```python
EXAMPLE 3:
Doctor: "Have you noticed any changes in your color vision?"
Patient: "Yes, colors don't seem as bright. Everything looks washed out."

EXTRACTION:
{
  "Symptoms: Faded colors": {"present": true, "sentiment": "negative", "details": "Colors appear washed out, not as bright"}
}
```

## QUICK REFERENCE COMMANDS

```bash
# Install everything
pip install --break-system-packages PyPDF2 pandas openpyxl requests

# Start Ollama server
ollama serve

# Download model (in another terminal)
ollama pull llama3.1:8b

# Run analyzer
python3 medical_transcript_analyzer.py

# Check outputs
ls -lh /home/claude/output/
```

## EXPECTED TIMELINE

- Setup (first time): 30 minutes
- Model download: 10-15 minutes
- Processing 25 PDFs: 20-30 minutes
- **Total: ~60-75 minutes for complete setup and first run**

Subsequent runs will be faster (no download needed).

## WHAT YOUR SIR WILL GET

✅ Total number of unique conversations
✅ Main questions asked to each patient
✅ All entities extracted with yes/no presence
✅ Sentiment analysis for each entity
✅ Patient concerns, goals, severity
✅ Tabulated Excel output for analysis
✅ Comparison between Llama 3.1 and Gemma models

This is exactly what was requested in both messages!
