Problem Solver | Machine Learning Engineer | Data Scientist

About Me

Hi, I’m John! I love solving problems by extracting meaningful insights out of data. By day, I work as a machine learning engineer where I have developed and manage machine learning solutions in the financial, healthcare and telecommunications industries.

My passion is applying machine learning tools to unsolved problems and new domains. Across projects and academic research, I enjoy diving into the technical details, understanding the nuances of the problem and strategizing for novel solutions and improvements. I place particular focus on the human element of each problem; studying the ways in which humans solve the problem manually, designing human-in-the-loop production systems, and communicating with end users to optimize usability and fairness.

My specialty is framing and storytelling. I believe that clearly and succinctly framing the problem statement is the first and most important step of a successful project. When it comes to communicating completed works, I excel at distilling complex data science topics into relatable, digestible stories to achieve stakeholder alignment and investment.

I believe that LLM Agents and Multi-Modal Foundation models are the next phase of the AI revolution. Having developed several internal LLM applications, I have witnessed the potential for LLMs to automate specific internal workflows and generating high-level content. These quick wins have been impactful and valuable to my company, but not paradigm-shifting improvements. Providing LLMs with access to a wider toolkit and the agency to plan and execute workflows will lead to more dramatic improvements and efficiency gains in the near future. I believe that foundation models must incorporate visual information to develop higher-powered world models; and also that computer vision systems ought to incorporate LLMs to expand capabilities and increase flexibility.

When I’m not working on my next project, I am seeking adventure by hiking, traveling, or exploring my city. I am fueled by sharing experiences with others and I believe that investing in the people around me is the most meaningful approach to life and work.

Project Spotlight

Upscaling Global Carbon Uptake

Accurate assessments of Gross Primary Productivity (GPP) are pivotal for evaluating the effectiveness of climate change mitigation efforts. However, current reliable estimates are limited to data from eddy covariance tower sites, which are sparsely distributed. This scarcity impedes the ability to accurately quantify GPP at scales ranging from regional to global. Previous attempts to upscale in situ GPP measurements to comprehensive global maps at sub-daily frequencies have encountered obstacles, including a paucity of input data at finer temporal resolutions and substantial amounts of missing information.

In response, our research introduces an innovative approach for upscaling, leveraging the Temporal Fusion Transformer (TFT), which allows for integrating diverse temporal data sources and can capture complex nonlinear relationships and temporal dynamics in GPP estimation. To enhance model performance, we incorporated additional machine learning techniques, namely the Random Forest Regressor (RFR) and XGBoost, culminating in a hybrid model that combines the strengths of TFT with tree-based algorithms. The most effective model demonstrated an NSE (Nash-Sutcliffe Efficiency) of 0.704 and an RMSE (Root Mean Square Error) of 3.54.

A significant advancement of this study is the detailed analysis of feature importance within the encoder, segmented by temporal aspects and specific flux tower sites. This analysis significantly improved the interpretability of the multi-head attention mechanism and enabled researchers to develop a deeper understanding into the temporal dynamics that drive GPP variations.

Read the paper | Visit the website | Visit the repo

Podcast Topic Segmentation

Topic timestamps for podcasts allow users to identify which episode sub-topics are worthwhile and which are worth skipping, saving the user time and increasing the quality of average content consumed. Yet, most podcasters do not provide topic timestamps – likely due to time constraints. This research aims to provide a tool for podcasters to input their podcast transcript and receive a set of reliable topic segments.

Our solution uses a transformer architecture and SentenceBERT embeddings to classify topic transitions at the sentence-level across the podcast. Unsupervised, sentence similarity-based methods were also explored, but to less success. To create our dataset, ~3.5k podcast episode transcripts and segment timestamps were scraped from YouTube. To enable an effective deep learning approach, 50k “synthetic” episodes were created by separating individual episode topics, shuffling, and recombining topics across the train episodes.

Though no baseline comparison is available in the podcast segmentation domain, our metrics (WindowDiff = 0.33, PK=0.37) are similar to those of a Facebook research approach (WindowDiff = 0.33, PK=0.33) in the meeting transcript segmentation domain (Solbiati et al. 2021).

Read the paper | Visit the repo

Logo Detection

Logo recognition is a fundamental computer vision problem. Some uses cases include:

  • Brand monitoring: track the usage of brand's logo in public settings, applications and video advertisements
  • Image indexing: enables retrieval of images from large databases based on presence of specific logos
  • Copyright and Trademark compliance: to detect patent infringement of brands with similar or copied designs
  • Document classification: to identify and categorize documents based on publisher's logo

The objective of this project is to apply both deep learning (YOLO v6) and non-learned feature extraction methods (BoW SIFT, Color, Shape, Texture) to the logo recognition task across 10 popular logo classes. The logo detection process can be split in two tasks: determining the location of the logo (bounding box) and logo classification. Both methods developed achieved a performance of 0.89 accuracy and F1-score.

This project immersed me in both the cutting-edge and classic solutions to object recognition space, and taught me the importance of exploring both learned and manual feature extraction methods in computer vision.

Read the paper | Visit the repo

Galileo Vision

In 2020, the Deloitte AI Guild hosted an internal Kaggle-style competition using black-and-white images of galaxies from the Sloan Digital Sky Survey (SDSS). The objective was to build a machine learning model that can accurately predict the Mass-to-Luminosity (M/L) ratio for each galaxy. For each galaxy, an image at 128px and 69px quality was available, along with a set of size and distance metrics in tabular format.

My solution aimed to incorporate information from both image and metadata by utilizing a multimodal model that trained on the image and tabular data jointly. The final model uses a fine-tuned VGG19 model for images that is concatenated at the fully connected layer with a dense neural network for image metadata. This solution achieved a MSE of 0.48 on the test set and took 2nd Place in the competition.

Visit the repo

Experience

ION Analytics

Machine Learning Engineer (March 2023 - Present)

  • Automated document classification and metadata extraction tasks with fine-tuned vision models and dynamic few-shot LLMs, decreasing processing times by 2+ days, reducing 2 costly FTE, and enabling scalability without additional overhead
  • Developed an LLM application to assist an RPA bot with interpreting notifications and dynamically navigating web portals, increasing automated document download rate by 20% and decreasing RPA development effort by 70%
  • Implemented an internal LLM prompt optimization and deployment framework using Microsoft Prompt Flow and AWS Lambda, significantly accelerating the rate of LLM application development and deployment

MojoJet LLC

Independent Machine Learning Contractor (February 2023 - November 2023)

  • Trained a multimodal model integrating a CNN and Transformer to estimate net carbon exchange with satellite imagery and remote sensing data for a forest preservation startup
  • Automated vendor normalization and classification with RAG-based LLMs for a procurement agency, saving 40 hours/month. Developed a web application to serve the models via AWS ECS with a React frontend and a FastAPI backend.

Deloitte Consulting

Consultant (May 2021 - February 2023)

  • Developed a Bi-LSTM in Tensorflow to identify likely instances of fraud in health records, improving audit selections and increasing expected payment error recovery for a government client by 75%
  • Held machine learning workshops and articulated complex model architecture designs to 6 business stakeholders to achieve buy-in throughout the model deployment process

Business Analyst (September 2019 - May 2021)

  • Engineered a custom Time-Aware LSTM model in PyTorch with skip-gram embeddings to predict unplanned hospital admissions with an AUC of 0.85 in the federally sponsored CMS AI Health Outcomes Challenge
  • Led the development, implementation, and automation of reporting tools in Tableau, which supported Deloitte’s win of a $70M, 5-year contract and continues to save 800+ hours/week across the 1,400 enabled users
  • Implemented a ServiceNow application for 1,400 users on a large SAP instance to improve PMO efficiency and centralize tools

Summer Scholar (June 2018 - August 2018)

  • Developed a K-Means clustering model in Python to improve classification of insurance claims in claim adjudication processes
  • Programmed scripts to automate reporting on Project Plan data and automatically email owners of past due items

R.J. O'Brien & Associates

Information Systems Intern (June 2017 - August 2017)

  • Configured XML schema to automate generation of PDF forms from Excel templates, saving 10 employees 2 hours of work per day
  • Developed automated data extraction and reporting processes for the expense platform and led training of BI team

Education

UC Berkeley, School of Information

(January 2022 - April 2023)

Master of Information and Data Science
Relevant Courses: Capstone, Computer Vision, Natural Language Processing, Machine Learning Systems Engineering, Experiments & Causal Inference, Data Engineering, Research Design, Statistical Inference

University of Notre Dame

(August 2015 - May 2019)

Bachelor of Business Administration, Business Analytics & Economics
Relevant Courses: Machine Learning, Unstructured Data Analytics, Data Storytelling, Econometrics, Probability, Linear Algebra, Health Economics, Statistical Inference

Skills

Contact Me

Cell: (312) 405-2048

Email: john@calzaretta.ai