Google BigQuery is an enterprise data warehouse built using BigTable and Google Cloud Platform. It’s serverless and completely managed. BigQuery works great with all sizes of data, from a 100 row Excel spreadsheet to several Petabytes of data. Most importantly, it can execute a complex query on those data within a few seconds.
We need to note before we proceed, BigQuery is not a transactional database. It takes around 2 seconds to run a simple query like ‘SELECT * FROM bigquery-public-data.object LIMIT 10’ on a 100 KB table with 500 rows. Hence, it shouldn’t be thought of as OLTP (Online Transaction Processing) database. BigQuery is for Big Data!
BigQuery supports SQL-like query, which makes it user-friendly and beginner friendly. It’s accessible via its web UI, command-line tool, or client library (written in C#, Go, Java, Node.js, PHP, Python, and Ruby). You can also take advantage of its REST APIs and get our job` done by sending a JSON request.
Now, let’s dive deeper to understand it better. Suppose you are a data scientist (or a startup which analyzes data) and you need to analyze terabytes of data. If you choose a tool like MySQL, the first step before even thinking about any query is to have an infrastructure in place, that can store this magnitude of data.
Designing this setup itself will be a difficult task because you have to figure out what will be the RAM size, DCOS or Kubernetes, and other factors. And if you have streaming data coming, you will need to set up and maintain a Kafka cluster. In BigQuery, all you have to do is a bulk upload of your CSV/JSON file, and you are done. BigQuery handles all the backend for you. If you need streaming data ingestion, you can use Fluentd. Another advantage of this is that you can connect Google Analytics with BigQuery seamlessly.
BigQuery is serverless, highly available, and petabyte scalable service which allows you to execute complex SQL queries quickly. It lets you focus on analysis rather than handling infrastructure. The idea of hardware is completely abstracted and not visible to us, not even as virtual machines.
Architecture of Google BigQuery
You don’t need to know too much about the underlying architecture of BigQuery. That’s actually the whole idea of it - you don’t need to worry about architecture and operation.
However, understanding BigQuery Architecture helps us in controlling costs, optimizing query performance, and optimizing storage. BigQuery is built using the Google Dremel paper.
Quoting an Abstract from the Google Dremel Paper -
“Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.”
You can read the paper here - Dremel: Interactive Analysis of Web-Scale Datasets.
Dremel was in production at Google since 2006. Google used it for the following tasks -
Analysis of crawled web documents.
Tracking install data for applications on Android Market.
Crash reporting for Google products.
OCR results from Google Books.
Debugging of map tiles on Google Maps.
Tablet migrations in managed Bigtable instances.
Results of tests run on Google’s distributed build system.
Disk I/O statistics for hundreds of thousands of disks.
Resource monitoring for jobs run in Google’s data centers.
Symbols and dependencies in Google’s codebase.
BigQuery is much more than Dremel. Dremel is just a query execution engine, whereas Bigquery is based on interesting technologies like Borg (predecessor of Kubernetes) and Colossus. Colossus is the successor to the Google File System (GFS) as mentioned in Google Spanner Paper.
How BigQuery Stores Data?
BigQuery stores data in a columnar format - Capacitor (which is a successor of ColumnarIO). BigQuery achieves very high compression ratio and scan throughput. Unlike ColumnarIO, now on BigQuery, you can directly operate on compressed data without decompressing it.
Columnar storage has the following advantages:
Traffic minimization - When you submit a query, the required column values on each query are scanned and only those are transferred on query execution. E.g., a query `SELECT title FROM Collection` would access the title column values only.
Higher compression ratio - Columnar storage can achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3.
(Image source: Google Dremel Paper)
Columnar storage has the disadvantage of not working efficiently when updating existing records. That is why Dremel doesn’t support any update queries.
How the Query Gets Executed?
BigQuery depends on Borg for data processing. Borg simultaneously instantiates hundreds of Dremel jobs across required clusters made up of thousands of machines. In addition to assigning compute capacity for Dremel jobs, Borg handles fault-tolerance as well.
Now, how do you design/execute a query which can run on thousands of nodes and fetches the result? This challenge was overcome by using the Tree Architecture. This architecture forms a gigantically parallel distributed tree for pushing down a query to the tree and aggregating the results from the leaves at a blazingly fast speed.
(Image source: Google Dremel Paper)
BigQuery vs. MapReduce
The key differences between BigQuery and MapReduce are -
Dremel is designed as an interactive data analysis tool for large datasets
MapReduce is designed as a programming framework to batch process large datasets
Moreover, Dremel finishes most queries within seconds or tens of seconds and can even be used by non-programmers, whereas MapReduce takes much longer (sometimes even hours or days) to process a query.
Following is a comparison on running MapReduce on a row and columnar DB:
(Image source: Google Dremel Paper)
Another important thing to note is that BigQuery is meant to analyze structured data (SQL) but in MapReduce, you can write logic for unstructured data as well.
Comparing BigQuery and Redshift
In Redshift, you need to allocate different instance types and create your own clusters. The benefit of this is that it lets you tune the compute/storage to meet your needs. However, you have to be aware of (virtualized) hardware limits and scale up/out based on that. Note that you are charged by the hour for each instance you spin up.
In BigQuery, you just upload the data and query it. It is a truly managed service. You are charged by storage, streaming inserts, and queries.
There are more similarities in both the data warehouses than the differences.
A smart user will definitely take advantage of the hybrid cloud (GCE+AWS) and leverage different services offered by both the ecosystems. Check out your quintessential guide to AWS Athena here.
Getting Started With Google BigQuery
Following is a quick example to show how you can quickly get started with BigQuery:
There are many public datasets available on bigquery, you are going to play with ‘bigquery-public-data:stackoverflow’ dataset. You can click on the “Add Data” button on the left panel and select datasets.
2. Next, find a language which has the best community, based on the response time. You can write the following query to do that.
3. Now you can execute the query and get results -
How to do Machine Learning on BigQuery?
Now that you have a sound understanding of BigQuery. It’s time for some real action.
As discussed above, you can connect Google Analytics with BigQuery by going to the Google Analytics Admin panel, then enable BigQuery by clicking on PROPERTY column, click All Products, then click Link BigQuery. After that, you need to enter BigQuery ID (or project number) and then BigQuery will be linked to Google Analytics. Note - Right now BigQuery integration is only available to Google Analytics 360.
Assuming that you already have uploaded your google analytics data, here is how you can create a logistic regression model. Here, you are predicting whether a website visitor will make a transaction or not.
Create a model named ‘velotio_tutorial.sample_model’. Now set the ‘model_type’ as ‘logistic_reg’ because you want to train a logistic regression model. A logistic regression model splits input data into two classes and gives the probability that the data is in one of the classes. Usually, in “spam or not spam” type of problems, you use logistic regression. Here, the problem is similar - a transaction will be made or not.
The above query gets the total number of page views, the country from where the session originated, the operating system of visitors device, the total number of e-commerce transactions within the session, etc.
Now you just press run query to execute the query.
BigQuery is a query service that allows us to run SQL-like queries against multiple terabytes of data in a matter of seconds. If you have structured data, BigQuery is the best option to go for. It can help even a non-programmer to get the analytics right! Related Read - Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow.
Did we miss anything? Do you have any questions about BigQuery and Machine Learning? Do let us know in the comments below! If you need help with using machine learning in product development for your organization, connect with experts at Velotio!
About the Author
Vipul is an R&D Engineer at Velotio. He is interested in the areas of Deep Learning and Distributed Systems. He has worked on a variety of technologies including containers, virtualization and machine learning. His hobbies include motorcycles, photography and playing the violin. Also, he is an amateur boxer.