๐ฝ๏ธ Mangetamain, Garde l'Autre Pour Demain¶
"Eat this one, save the other for tomorrow!"
Mangetamain is a leader in B2C recipe recommendations powered by massive data analytics. We're sharing our best insights with the world through an interactive web platform where everyone can discover what makes recipes delicious... or not! ๐ Visit Our Live Application
This Streamlit web application provides comprehensive data analysis and visualization of cooking recipes and user interactions based on the Food.com Kaggle dataset. Explore how recipes are rated, discover food trends, and understand user behavior through interactive dashboards powered by big data tools and advanced visualization techniques.
๐ Features¶
- ๐ Recipe Analysis: Extract and visualize key recipe metrics (ingredients, steps, cooking time, nutritional info)
- ๐ฅ User Interaction Insights: Analyze user ratings, reviews, and behavioral patterns
- ๐ Automated Data Processing: Efficient data cleaning and transformation into parquet format
- ๐ Interactive Dashboards: Built with Streamlit for real-time data exploration
- ๐จ Advanced NLP Visualizations:
- Word clouds (frequency-based and TF-IDF)
- Polar plots for ingredient analysis
- Venn diagrams for method comparison
- ๐๏ธ Modular Architecture: Clean separation of backend, frontend, and utilities
- ๐ณ Docker Support: Containerized deployment for easy scaling
- โก High Performance: Leverages Polars for 10x faster DataFrame operations
๐ง Usage¶
Once the application is running, you can access multiple analytical pages:
๐ฑ Available Dashboards¶
- Ratings Dashboard - Rating distributions and popularity analysis
- Trends - Temporal trends and popular recipe categories
- User Analysis - User behavior patterns and engagement metrics
- Recipe Analysis - Deep dive into recipe characteristics and ingredients
๐ฏ Key Capabilities¶
- Filter recipes by preparation time, rating, or specific ingredients
- Visualize rating distributions and popularity trends over time
- Explore word clouds of recipe reviews and ingredient frequencies
- Compare different NLP analysis methods (frequency vs. TF-IDF)
- Interactive polar plots for ingredient category analysis
๐ ๏ธ Installation¶
Prerequisites¶
- Python 3.12+
- UV package manager
- Docker (optional, for containerized deployment)
Local Setup¶
# Clone repository
git clone https://github.com/shanoether/kit_big_data_mangetamain.git
cd kit_big_data_mangetamain
# Install dependencies
uv sync
# Download spaCy model
uv run python -m spacy download en_core_web_sm
Data Preparation¶
- Download Food.com dataset
- Place
RAW_interactions.csvandRAW_recipes.csvindata/raw/ - Run preprocessing:
uv run python src/mangetamain/backend/data_processor.py
This generates optimized .parquet files and a cached recipe_analyzer.pkl in data/processed/.
Launch Application¶
uv run streamlit run src/mangetamain/streamlit_ui.py
Access at http://localhost:8501
Docker Deployment¶
# Build and start all services
docker-compose -f docker-compose-local.yml up
# Access the application at http://localhost:8501
# Stop the services
docker-compose -f docker-compose-local.yml down
This will: 1. Build the Docker image locally 2. Spawn a preprocessing container to process the data 3. Launch the Streamlit webapp in a container
๐ Project Structure¶
kit_big_data_mangetamain/
โโโ src/mangetamain/
โ โโโ streamlit_ui.py # Main application entry
โ โโโ backend/
โ โ โโโ data_processor.py # ETL pipeline
โ โ โโโ recipe_analyzer.py # NLP analysis
โ โโโ frontend/pages/ # Streamlit pages
โ โ โโโ overview.py
โ โ โโโ recipes_analysis.py
โ โ โโโ users_analysis.py
โ โ โโโ trends.py
โ โ โโโ dashboard.py
โ โโโ utils/
โ โโโ logger.py # Custom logging
โ โโโ helper.py # Data loading utilities
โโโ tests/unit/ # Unit tests
โโโ data/
โ โโโ raw/ # CSV input files
โ โโโ processed/ # Parquet & pickle outputs
โโโ docs/ # MkDocs documentation
โโโ .github/workflows/deploy.yml # CI/CD pipeline
โโโ docker-compose.yml # Production deployment
โโโ Dockerfile # Multi-stage build
โโโ pyproject.toml # Dependencies & config
๐ป Development Functionalities¶
๐ฏ Object-Oriented Programming (OOP)¶
The project follows OOP best practices with a clean, modular architecture. However, the object-oriented approach is very basic as Streamlit implements its own logic and is not made to be run into different classes.
Core Classes¶
DataProcessor(src/mangetamain/backend/data_processor.py)- Purpose: ETL pipeline for data cleaning and transformation
- Key Methods:
load_data(): Load and validate raw CSV/ZIP filesdrop_na(): Remove rows with missing or unrealistic valuessplit_minutes(): Categorize recipes by cooking timemerge_data(): Join interactions with recipe metadatasave_data(): Export processed data to Parquet format
-
Features: Type hints, comprehensive docstrings, error handling
-
RecipeAnalyzer(src/mangetamain/backend/recipe_analyzer.py) - Purpose: NLP analysis and visualization of recipe data
- Key Methods:
preprocess_text(): Batch spaCy processing for 5-10x performancefrequency_wordcloud(): Generate frequency-based word cloudstfidf_wordcloud(): Generate TF-IDF weighted word cloudscompare_frequency_and_tfidf(): Venn diagram comparisonplot_top_ingredients(): Polar plots for ingredient analysis
- Features: LRU caching, figure memoization, streaming support
-
Serialization: Supports
save()andload()for pickle persistence -
BaseLogger(src/mangetamain/utils/logger.py) - Purpose: Centralized logging with colored console output
- Features:
- Rotating file handlers (5MB max, 3 backups)
- ANSI color codes for different log levels
- Thread-safe singleton pattern
- Separate log files per session
Design Patterns Used¶
- Factory Pattern: Logger instantiation via
get_logger() - Singleton Pattern: Single logger instance per module
- Strategy Pattern: Different word cloud generation strategies (frequency vs. TF-IDF)
- Caching Pattern: LRU cache decorators for expensive operations
๐๏ธ Frontend/Backend Architecture¶
The application follows a separation of concerns architecture with distinct backend and frontend components:
Two-Stage Container Architecture¶
Our Docker deployment uses a sequential container orchestration pattern:
Stage 1: Backend Processing Container¶
- Purpose: Heavy data preprocessing and transformation
- Process: Runs
data_processor.pyto: - Load raw CSV datasets from Kaggle
- Clean and validate data (remove nulls, filter outliers)
- Transform data into optimized formats
- Generate
.parquetfiles for fast columnar storage - Serialize NLP models to
.pklfiles - Lifecycle: Automatically shuts down after successful completion
- Output: Persisted files in
data/processed/volume
Stage 2: Frontend Application Container¶
- Purpose: Lightweight web interface for data visualization
- Process: Runs Streamlit application
- Data Access: Reads preprocessed
.parquetand.pklfiles - Lifecycle: Runs continuously to serve the web application
- Resources: Minimal CPU/memory footprint
Architecture Benefits¶
- Separation of Concerns
- Backend handles computationally expensive ETL operations
- Frontend focuses solely on visualization and user interaction
- Improved Stability:
- Frontend never performs heavy preprocessing
- No risk of UI crashes during data processing
-
Graceful failure isolation
-
Resource Efficiency:
- Backend container only runs when data updates are needed
- Frontend container remains lightweight and responsive
- Optimized resource allocation per workload type
- Faster Startup:
- Frontend launches instantly with preprocessed data
- No waiting for data processing on application start
- Better user experience
๐ Continuous Integration (Pre-Commit)¶
We maintain code quality through automated pre-commit checks:
Pre-Commit Hooks¶
Our .pre-commit-config.yaml includes:
- Code Quality
- โ
ruff: Fast Python linter and formatter - โ
ruff-format: Code formatting (PEP 8 compliant) -
โ
mypy: Static type checking -
File Integrity
- โ
trailing-whitespace: Remove trailing whitespace - โ
end-of-file-fixer: Ensure files end with newline - โ
check-merge-conflict: Detect merge conflict markers - โ
check-toml: Validate TOML syntax -
โ
check-yaml: Validate YAML syntax -
Testing
- โ
pytest: Run unit tests before commit
Running Pre-Commit Checks¶
# Pre-commit checks
uv run pre-commit install
# Run on all files manually
uv run pre-commit run --all-files
# Linting & formatting
uv run ruff check .
uv run ruff format .
# Type checking
uv run mypy src
For detailed pre-commit procedures, see our comprehensive guide:
๐ Pre-Commit Workflow Playbook
๐ Documentation¶
We use MkDocs with the Material theme for comprehensive documentation.
Documentation Structure¶
- Available on GitHub: Documentation is automatically updated during deployment and published on GitHub Pages.
- API Reference: Auto-generated from docstrings using
mkdocstrings - Playbooks: Step-by-step guides for common tasks
- Environment setup
- Pre-commit workflow
- Troubleshooting
- GCP deployment
- User Guides: How to use the application features
Serving Documentation Locally¶
# Start documentation server
uv run hatch run docs:serve
# Access at http://127.0.0.1:8000
Building Documentation¶
# Build static documentation site
uv run hatch run docs:build
# Deploy to GitHub Pages
uv run hatch run docs:deploy
Documentation Configuration¶
- Tool: MkDocs with Material theme
- Plugins:
mkdocstrings: API reference generationgen-files: Dynamic content generationsection-index: Automatic section indexinginclude-markdown: Markdown file inclusion- Auto-generation:
scripts/gen_ref_pages.pygenerates API docs from source code
๐ View Full Documentation (after running docs:serve)
๐ Logger¶
Our custom logging system provides structured, colorful logs for better debugging:
Features¶
- Color-Coded Output: Different colors for DEBUG, INFO, WARNING, ERROR, CRITICAL
- File Rotation: Automatic log rotation (5MB files, 3 backups)
- Dual Output: Console and file handlers
- Session-Based: Separate log files per application run
- Thread-Safe: Safe for concurrent operations
Usage¶
from mangetamain.utils.logger import get_logger
logger = get_logger()
logger.debug("Debugging information")
logger.info("Processing started")
logger.warning("Deprecated feature used")
logger.error("Failed to load data")
logger.critical("System shutdown required")
Log Storage¶
Logs are stored in logs/app/ with timestamped filenames:
logs/app/
โโโ app_20251028_143052.log
โโโ app_20251028_143052.log.1
โโโ app_20251028_143052.log.2
๐ Continuous Deployment¶
We use GitHub Actions for automated deployment to our Google Cloud VM.
CI/CD Pipeline¶
Our .github/workflows/deploy.yml implements a two-stage pipeline:
Stage 1: Security Scan¶
- Tool: Safety CLI
- Purpose: Scan dependencies for known security vulnerabilities
- Trigger: Every push to
mainbranch - Action: Fails pipeline if vulnerabilities found
Stage 2: Build & Deploy¶
Only runs if security scan passes:
- Build Docker Image
- Multi-stage build for optimized image size
- Tags:
latestandsha-<commit-sha> -
Push to GitHub Container Registry (GHCR)
-
Deploy to VM
- SSH into Google Cloud VM
- Pull latest code and Docker images
- Run
docker compose up -d -
Zero-downtime deployment with health checks
-
Deploy Documentation
- Build documentation with MkDocs
- Deploy to GitHub Pages automatically
- Available at:
https://shanoether.github.io/kit_big_data_mangetamain/
Deployment Flow¶
Push to main โ Security Scan โ Build Docker โ Push to GHCR โ Deploy to VM โ Deploy Docs to GitHub Pages
Environment Variables & Secrets¶
Required GitHub secrets:
- SAFETY_API_KEY: Safety CLI API key
- SSH_KEY: Private SSH key for VM access
- GHCR_PAT: GitHub Personal Access Token
- SSH_HOST: VM IP address (environment variable)
- SSH_USER: VM SSH username (environment variable)
Manual Deployment¶
For manual deployment to Google Cloud, see:
๐งช Tests¶
We maintain comprehensive test coverage across all modules.
Test Coverage¶
- Overall Coverage: ~90%+ across core modules
- Backend: 100% coverage on
DataProcessorandRecipeAnalyzer - Utils: 100% coverage on
loggerand helper functions - Frontend: Core Streamlit functions tested with mocking
Running Tests¶
# Run all tests
uv run pytest
# With coverage
uv run pytest --cov=src --cov-report=html
# Specific test
uv run pytest tests/unit/mangetamain/backend/test_recipe_analyzer.py
๐ Security¶
- Dependency Scanning: Automated Safety CLI checks on every commit. Here we are still using vulnerable libraries but the risk is small as we don't require user input, there is no account linked to this application and it runs in an isolated environment. A summary of the last security review can be found in
docs/security_review.png. - Firewall: Only port 443 exposed
- SSH: Key-based authentication only, no passwords
- Docker: Non-root user, minimal base image
- Secrets: GitHub Secrets for credentials, no hardcoded values
- Errors Page: Users do not get exposed to precise error messages but generic ones to avoid exploitation.
- HTTPS Connection: Secure connection through HTTPS with certificate generated with
letsencrypt.
โก Performance¶
We currently have some unresolved performance issues. The loading of different pages is slow despite the various techniques we set up to reduce lagging.
Optimizations:
- Polars: 10-30x faster than Pandas for large datasets
- Batch Processing: spaCy processes 100 texts at a time
- Caching: @st.cache_data for data, @st.cache_resource for models
- Lazy Loading: Data loaded only when needed
Profiling:
uv run py-spy record -o profile.svg -- python src/mangetamain/backend/data_processor.py
๐งโ๐ป Contributing¶
Use our issue templates for bug reports or feature requests:
- Bug Report:
issue_template/bug_report.md - Feature Request:
issue_template/feature_request.md
๐ฑ Future Improvements¶
We're continuously working to improve Mangetamain. Here are planned enhancements:
- ๐ Recipe Clustering: ML-based similarity analysis to discover recipe patterns and group similar recipes
- ๐ Advanced Visualizations:
- Network graphs for review relationships
- Heatmaps for user behavior patterns
- โ๏ธ Enhanced CI/CD Pipeline:
- Add automated testing stage
- Manual approval gate before production deployment
- ๐งฎ Advanced Analytics:
- Sentiment analysis on user reviews
- Anomaly detection for unusual rating patterns (bots)
- ๐๏ธ Code improvment:
- Move from Parquet to PostgreSQL for better scalability with an API endpoint for the frontend to connect to
- Implement data versioning
- Move from pyplot to plotly
- Migrate towards libraries without vulnerabilities
๐ Project Metrics¶
- Test Coverage: 90%+
- Python Version: 3.12+
- Docker Image: ~1.5GB (multi-stage optimized)
- Lines of Code: ~5,000
๐ Acknowledgments¶
- Dataset: Food.com Recipes and User Interactions from Kaggle
- Framework: Streamlit for the interactive web interface
- Data Processing: Polars for high-performance data operations
- NLP: spaCy for natural language processing
- Deployment: Google Cloud Platform for hosting
๐ Additional Resources¶
- ๐ Environment Setup Playbook - Detailed environment configuration
- ๐ Pre-Commit Playbook - Code quality workflow
- ๐ Troubleshooting Guide - Common issues and solutions
- ๐ GCP Deployment Guide - Production deployment
๐ License¶
MIT License - see LICENSE file
๐ Contact¶
- Issues: GitHub Issues
- Email: gardelautrepourdemain@mangetamain.ai
- Live App: https://mangetamain.duckdns.org/