Lessons learnt while building an ETL pipeline for MongoDB & Amazon Redshift using Apache Airflow

Lessons learnt while building an ETL pipeline for MongoDB & Amazon Redshift using Apache Airflow

Recently, the author was involved in building a custom ETL(Extract-Transform-Load) pipeline using Apache Airflow which included extracting data from MongoDB collections and putting it into Amazon Redshift tables. 

Each ETL pipeline comes with a specific business requirement around processing data which is hard to be achieved using off-the-shelf ETL solutions. This is why a majority of ETL solutions are built manually, from scratch. In this blog, I am going to talk about my learnings around building an optimized, efficient, near real-time and fault tolerant custom ETL solution using Apache Airflow which involved moving data from MongoDB to Redshift.

Real Time Text Classification using Kafka and scikit-learn

Real Time Text Classification using Kafka and scikit-learn

Text classification is one of the important tasks in supervised machine learning (ML). Assigning categories to text, which can be tweets, facebook posts, web page, library book, media articles, gallery etc. has many applications like spam filtering, sentiment analysis etc.

In this blog we build a text classification engine to classify topics in an incoming Twitter stream using Apache Kafka and scikit-learn - Python based Machine Learning Library.

Web Scraping: Introduction, Best Practices & Caveats

Web Scraping: Introduction, Best Practices & Caveats

Web Scraping is the process of data extraction from various websites present over the internet. Web Scraping has a wide variety of use cases. Marketing & Sales Intelligence companies use web scraping to fetch customer related information, Real Estate Tech companies use web scraping to fetch real estate listings, Price Comparison Portals use web scraping to fetch product and price information from various e-commerce sites. This blog is meant to be a primer on building highly scale-able scrappers. The blog will cover different ways to scrape, how to scrape at scale and guidelines while writing scrappers.

Building Stateless Bots using Rasa Stack

Building Stateless Bots using Rasa Stack

This blog aims at exploring the Rasa Stack to create a stateless chat-bot. We will look into how, the recently released Rasa Core, which provides machine learning based dialogue management, helps in maintaining the context of conversations using machine learning in an efficient way.

We will also build a sample chatbot using Rasa Core.

Continuous Deployment with Azure Kubernetes Service, Azure Container Registry & Jenkins

Continuous Deployment with Azure Kubernetes Service, Azure Container Registry & Jenkins

This blog talks about Azure's Kubernetes as a Service offering - AKS. I came across various issues while setting up AKS and its container registry so wanted to share some gotchas.

Finally, this blog provides the steps to setup continuous deployment pipeline with Azure Kubernetes Service, Azure Container Registry & Jenkins.

Chatbots with Google DialogFlow: Build a fun Reddit chatbot in 30 minutes

Chatbots with Google DialogFlow: Build a fun Reddit chatbot in 30 minutes

Google DialogFlow (formerly, api.ai) is a platform that provides a use-case specific, engaging voice and text-based conversations, powered by AI. In this blog, we will learn about DialogFlow and proceed to build a chatbot that can interact with Reddit.

Amazon Lex + AWS Lambda: Beyond Hello World

Amazon Lex + AWS Lambda: Beyond Hello World

In my previous blog, I explained how to get started with Amazon Lex and build simple bots. This blog aims at exploring the Lambda functions used by Amazon Lex for code validation and fulfillment. We will go along with the same example we created in our first blog i.e. purchasing a book and will see in details how the dots are connected.

This blog provides detailed overview about how Lex & Lambda can be used to develop bots.

API Testing using Postman and Newman

API Testing using Postman and Newman

In the last few years, we have an exponential increase in the development and use of APIs. We are in the era of API-first companies like Stripe, Twilio, Mailgun etc. where the entire product or service is exposed via REST APIs. Web applications also today are powered by REST-based Web Services. APIs today encapsulate critical business logic with high SLAs. Hence it is important to test APIs as part of the continuous integration process to reduce errors, improve predictability and catch nasty bugs.

In the context of API development, Postman is great REST client to test APIs. Although Postman is not just a REST Client, it contains a full-featured testing sandbox that lets you write and execute Javascript based tests for your API. This blog talks about using Postman and its CLI tool - Newman to easily automate API Tests.

Machine Learning for your Infrastructure: Anomaly Detection with Elastic + X-Pack

Machine Learning for your Infrastructure: Anomaly Detection with Elastic + X-Pack

We need a practical and scalable approach to understand the cause-effect relationship between data sources and events across complex infrastructure of VMs, containers, networks, micro-services, regions, etc. Machine learning is particularly useful for such problems where we need to identify “what changed”, since machine learning algorithms can easily analyze existing data to understand the patterns, thus making easier to recognize the cause. This is known as unsupervised learning, where the algorithm learns from the experience and identifies similar patterns when they come along again.

Let's see how you can setup Elastic + X-Pack to enable anomaly detection for your infrastructure & applications.

Tutorial: Developing complex plugins for Jenkins

Tutorial: Developing complex plugins for Jenkins

Jenkins is the most popular Continuous Integration and Continuous Delivery (CI/CD) server. Jenkins is used for managing complex CI/CD pipelines that support building, deploying and automating software.  Every team has different needs and CI/CD is a process that needs heavy customization. 

Recently, I needed to develop a complex Jenkins plug-in for a customer in the containers & DevOps space. In this process, I realized that there is lack of good documentation on Jenkins plugin development. That’s why I decided to write this blog to share my knowledge on Jenkins plugin development.

A practical guide to deploying multi-tier applications on Google Container Engine (GKE)

A practical guide to deploying multi-tier applications on Google Container Engine (GKE)

In this blog, we look at how to deploy, scale & delete a Multi-tier (Flask/Python and MySQL) Application in Google Container Engine.

Elasticsearch 101: Fundamentals & Core Components

Elasticsearch 101: Fundamentals & Core Components

Elasticsearch is currently the most popular way to implement free text search in your application. This blog post is an introduction to Elasticsearch including components and data types. It covers the some of the basic but important concepts of Clusters, different types of Nodes, Documents, Mappings, Indices, and Shards.