Computer Vision Code Assessment

Automated Code Quality Analysis Using Machine Learning

Python OpenCV TensorFlow PyTorch Flask API

Project Overview

An innovative computer vision system that analyzes programming code screenshots to automatically detect common coding errors and provide educational feedback. The system combines OCR preprocessing with trained ML models to identify syntax errors, style violations, and logical mistakes in code images.

Key Features

  • Advanced OCR preprocessing pipeline for code extraction
  • Deep learning models trained on code error patterns
  • Multi-language support (Python, Java, C, JavaScript)
  • Syntax error detection and classification
  • Educational feedback system with explanations
  • RESTful API for integration with educational platforms

Technical Implementation

Computer Vision Pipeline
  • Image Preprocessing: Deskewing, noise reduction, contrast enhancement
  • Text Detection: Custom-trained EAST model for code region detection
  • OCR: Tesseract OCR with custom code-optimized configurations
  • Post-processing: Code formatting correction and syntax restoration
ML Architecture
  • Error Detection Model: CNN-based architecture for identifying error patterns
  • Classification: Multi-class classifier for error type categorization
  • Training Data: Custom dataset of 10,000+ code screenshots with annotations
  • Frameworks: TensorFlow for production, PyTorch for experimentation
API Design
  • Flask RESTful API with authentication and rate limiting
  • Endpoints: Image upload, error analysis, batch processing
  • Response Format: JSON with detected errors, confidence scores, suggestions

Challenges & Solutions

Challenge: OCR Accuracy for Code

Solution: Fine-tuned Tesseract with code-specific character sets and implemented custom post-processing to fix common OCR errors in programming syntax.

Challenge: Handling Different Code Styles

Solution: Built a normalization layer that standardizes different indentation styles and formatting conventions before analysis.

Challenge: Training Data Collection

Solution: Created a synthetic data generator that produces code screenshots with programmatically inserted errors, augmented with real student submissions.

What I Learned

  • Deep understanding of OCR systems and preprocessing techniques
  • Computer vision model training and optimization
  • Data augmentation strategies for specialized domains
  • Building production-ready ML APIs with Flask
  • Working with both TensorFlow and PyTorch frameworks
  • Creating educational systems that provide constructive feedback