Multi-Layer Perceptrons: From Mathematics to Implementation
A complete guide to understanding neural networks from first principles. Learn MLP mathematics with matrices, forward propagation, backpropagation, and implement everything from scratch.
rishabh.mondal@iitgn.ac.in
Hallo, Freunde, I am Rishabh Mondal (ঋষভ মণ্ডল). In Bengali, “Rishabh” means superior and also refers to the second note (Re) in the octave of Indian classical music: a sound associated with harmony and grace.
My research lies at the intersection of Earth Observation and Computer Vision, with a focus on environmental monitoring, geographical domain generalization, and foundation models for remote sensing. I am currently a Ph.D. scholar at the Sustainability Lab, IIT Gandhinagar, supervised by Prof. Nipun Batra. I hold an M.Tech (2023) in Information Technology from the Indian Institute of Engineering Science and Technology (IIEST), Shibpur, where I worked under the guidance of Dr. Prasun Ghosal in the domain of TinyML, and a B.Tech (2021) in Computer Science and Engineering from The Neotia University, Kolkata.

Deep dives into machine learning, computer vision, and geospatial AI. From foundational concepts to cutting-edge research.
5 Articles
3 Topics
A walkthrough of the ICLR 2024 paper that brings century-old physics into modern geospatial ML, explaining how spherical harmonics improve location encoding for neural networks.
A complete guide to understanding neural networks from first principles. Learn MLP mathematics with matrices, forward propagation, backpropagation, and implement everything from scratch.
A comprehensive Matplotlib tutorial in question-answer format that progresses from basics to advanced plotting design.
Understanding how modern multimodal models like Flamingo compress visual information for language models using learned latent queries and cross-attention.
A comprehensive tutorial covering data manipulation, analysis, and visualization using Pandas.
Coming Soon
An overview of foundation models in remote sensing and their applications for environmental monitoring.
A comprehensive Matplotlib tutorial in question-answer format that progresses from basics to advanced plotting design.
A comprehensive tutorial covering data manipulation, analysis, and visualization using Pandas.
A complete guide to understanding neural networks from first principles. Learn MLP mathematics with matrices, forward propagation, backpropagation, and implement everything from scratch.
Understanding how modern multimodal models like Flamingo compress visual information for language models using learned latent queries and cross-attention.
A walkthrough of the ICLR 2024 paper that brings century-old physics into modern geospatial ML, explaining how spherical harmonics improve location encoding for neural networks.
Coming Soon
An overview of foundation models in remote sensing and their applications for environmental monitoring.
Student-to-Student Research Collaboration in AI for Earth Observation
I believe in peer-to-peer learning and collaborative research. Research is not just about publishing papers. It’s about asking bold questions, learning through failure, and growing together. If you’re a student passionate about solving real-world problems using AI, let’s connect!
Send me an email or Linkedin with your background and interests
We discuss ideas and find a good project fit
Work together on research with regular sync-ups
Aim for a conference/journal publication together
The Problem: Self-supervised learning powers modern computer vision, but it was built for ImageNet: rich colors, clear objects, strong contrast. What happens when the visual world is subtle?
Think Mars terrain, medical tissue, and radar imagery. These are domains where every patch looks nearly identical, contrast is faint, and color barely exists.
We find that SSL methods train normally, loss converges, and checkpoints look healthy, but the learned features are quietly useless. We call this silent collapse.
This project diagnoses exactly where and why standard SSL objectives break on homogeneous imagery. We then propose fixes including frequency-aware masking, contrast-amplified augmentations, and hard-negative mining. These will be validated across Mars science tasks and medical imaging benchmarks.
The Problem: Current spatio-temporal forecasting models assume nearby regions behave similarly. But climate doesn’t respect borders: Mumbai’s monsoon dynamics may better predict Chennai than geographically closer Hyderabad.
We propose GeoRAG, a retrieval-augmented forecasting framework. Instead of memorizing spatial patterns during training, we encode each region’s geographic identity using satellite imagery, geo-images, and LLM-derived text descriptions into a unified embedding.
At inference, we retrieve the k most climatologically similar regions from anywhere on Earth and feed their historical time series as in-context examples to a lightweight transformer forecaster.
The key technical contribution is a learned multimodal geographic similarity metric, trained so that regions with correlated meteorological behavior cluster together in embedding space, regardless of physical distance.
The Problem: HD map construction requires expensive LiDAR-equipped vehicles to drive every road. What if we could generate lane-level HD maps directly from satellite imagery, for any location on Earth?
We propose SatBEV, a cross-view geospatial foundation model that learns dense lane-level correspondence between satellite imagery and street-level driving observations. Instead of treating satellite images as auxiliary features for HD map construction, we train a dual-encoder architecture that aligns overhead and perspective views in a shared geometric embedding space at instance-level lane granularity.
The model is pretrained on OpenSatMap’s 38k satellite images across 60 cities using a masked autoencoder objective, then fine-tuned on the spatially aligned nuScenes and Argoverse 2 subsets with three joint objectives: contrastive satellite-BEV alignment, lane-level cross-view correspondence prediction via GPS-supervised cross-attention, and a map transfer objective that predicts ego-vehicle-frame HD maps from satellite tiles alone.
The key technical contribution is a cross-view transformer with GPS-derived positional encoding that produces dense correspondence fields mapping satellite lane pixels to BEV lane pixels, handling projective distortion and partial observability. At inference, this enables HD map generation for any location with satellite coverage — no driving data or onboard sensors required.
We evaluate on zero-shot HD map prediction (IoU on nuScenes val for divider/crossing/boundary), few-shot city adaptation with 10 street-level samples, and ablations on whether OpenSatMap’s instance-level annotations improve transfer over coarse semantic labels. The baseline to beat is SatforHDMap at 50.9 IoU; the target is generalization to cities entirely absent from driving datasets.
Have your own idea? I’m open to discussing it!
Send me an email with: