Accomplished Principal Data Engineer of 10+ years with a strong background in machine learning and data engineering. Proven track record of architecting and building scalable data lakes, warehouses, and pipelines using Snowflake, Databricks, HCL/Terraform, AWS, and dbt. Skilled in development, mentoring peers, and driving technical innovation. Experienced in machine learning, statistical analysis, with a strong emphasis on data modeling, NLP, entity resolution, and data processing. Proficient in multiple programming languages, including Python, PySpark, and Terraform, with a demonstrated ability to lead complex projects and teams to success.
Machine Learning | Model development, Feature Engineering, Method selection | Data Engineering | Pipelines and data ingestion, Data Lakes, Architecture |
Test Design and Analysis | A/B Testing, Sample Design, Power Analysis | Research | Methods, Application, Development, Anal- ysis |
Statistical Analysis | Regression Modeling, Data Mining, Time Series | Programming | Python Development, APIs, Serverless, ML Model Deployment |
Programming Languages | Python, SQL, PySpark, HCL/Terraform, R | Databases | Snowflake, Postgres, SQLite, MSSQL, MySQL, MongoDB, Redshift, duckdb |
Cloud Providers | AWS, Azure | Other | Docker, Bash, Databricks, Markdown, HTML, Linux, dbt |
Consulting and Development Firm
Developing; LLM Agent applications. RAG based content generation for small businesses.
Research; Distributed data lakes, data transfers over gRPC, Open source LLM deployment and development.
Data Engineer Consulting
Data warehousing and architecting; architected and built data lakes using s3 and Snowflake; enabling fast migrations, disaster recovery, and external data processes (entity resolution, data pipelines). Migrated Looker project to a data warehouse on Redshift using dbt and DMS; reducing the complexity and increasing performance. Full warehouse modeling from staging to marts. Built cost calculator for estimating credit consumption and cost for Snowflake queries.
Data pipelines development; using Databricks, Python, Terraform, and AWS, built low cost and reliable data pipelines. Choice project was developing AWS lambda based pipelines using dbt, terraform, docker, and python. Scoped and implemented a migration from MySQL Jenkins pipelines to Databricks using DMS for replication.
Project Management; handled scoping and pointing development work. Worked with PMs to identify projects risks and remediation. Managed client communications on technical topics. Worked with direct manager on implementing PERT style analysis to improve client deliverable estimates.
Mentorship; mentoring peers on problem solving and technical skills. Highlight was assisted peer with learning PySpark and programming best practices. Leading them to be promoted to senior engineer and certified in Databricks.
Home Services and Marketplace
Background Screening and Transportation
Ordering behavior analysis; created data mart to pool together data distributed across multiple systems. Used run length encoding to normalize and compress order history. Run length encodings were then analyzed in SQL for streaks and co-occurrences of order types. Analysis resulted in highlighting areas needing process intervention and system visibility.
Created algorithms for record linkage; designed algorithms to clean and match text strings. Methods included Jaccard and Jaro distances. Custom cleaning algorithms sourced first name and surname data from the US Census and Social Security Administration.
Deep learning driver risk modeling; WOE, one-hot encoding, mixed-effects models for factor en- coding, and automated regression trees for variable imputation were all implemented and explored for creating a predictive model of high risk drivers. Multiple layers, epochs, drop out, and regularization were tested for effects on performance. Final output was recalibrated using Platt scaling. Work was performed in conjunction with Principal Data Scientist for deploying a risk model to production.
Healthcare and HR
Directed graph analysis with sentiment model; examined comments for strong bigram relationships. Added sentiment to graph to identify additional associations. Potential areas for improvement in communication were identified based on the common sentiment of keywords and strength of keyword relationships.
Designed comment similarity method; created a method for finding the most representative comments using a combination of TFIDF and cosine similarity to summarize responses.
SaaS Startup and Lead Sourcing
Built and managed outsourced team; seeing the need for an increased and flexible workforce created and qualified a team of 20+ international workers.
Implemented algorithms for data cleansing and matching; trained a PAM ML model using the output from Monge-Elken string distances. Final predictions were made with a 1 nearest neighbor model for efficiency. Process had ∼90% accuracy and allowed us to send targeted leads to our clients.
Minors: Physics and Computer Science
Serverless LLM Inference | https://github.com/graphicalmethods/serverless-llamas |
Lifetime Value Estimation | https://examples.benhoffman.net/lifetime_value |
___________________________________________ / \ | What do you call a dinosaur that wears a | | cowboy hat? | | ------------------------------------------- | | A Tyrannosaurus tex. | \ / =========================================== \ \ \ \ .-=-==--==--. ..-==" ,'o`) `. ,' `"' \ : ( `.__...._ | ) / `-=-. : ,vv.-._ / / `---==-._ \/\/\/VV ^ d88`;' / `. `` ^/d88P!' / , `._ ^/ !' ,. , / "-,,__,,--'""""-. ^/ !' ,' \ . .( ( _ ) ) ) ) ))_,-.\ ^(__ ,!',"' ;:+.:%:a. \:.. . ,' ) ) ) ) ,"' ' ',,,'',' /o:::":%:%a. \:.:.: . ) ) _,' """' ;':::'' `+%%%a._ \%:%| ;.). _,-"" ,-='_.-' ``:%::) )%:| /:._," (/(/" ," ,'_,'%%%: (_,' ( (//(`.___; \ \ \ ` ` `. `. `. : \. . .\ : . . . : \. . .: `.. . .: `..:.:\ \:...\ ;:.:.; ::...: ):%:: :::::; __,::%:( ::::: ,;:%%%%%%%: ;:%:: ;,--""-.`\ ,=--':%:%:\ /" "| /-".:%%%%%%%\ ;,-"'`)%%) /" "|