Your Quintessential Guide to AWS Athena

Introduction

Serverless has become a new trend today and is here to stay for sure! Now when you think of wireless internet, you know that it still has some wires but you don’t need to worry about them as you don’t have to maintain them. Similarly, serverless has servers but you don’t have to keep worrying about handling or maintaining them. All you need to do is focus on your code and you’re good to go.

It has some more benefits, such as:

  • Zero administration: You can deploy code without provisioning anything beforehand, or managing anything later. There is no concept of a fleet, an instance, or even an operating system.

  • Auto-scaling: It lets your service providers manage the scaling challenges. You don’t need to fire alerts or write scripts to scale up and down. It handles quick bursts of traffic and weekend lulls the same way.

  • Pay-per-use: The function-as-a-service compute and managed services are charged based on usage rather than pre-provisioned capacity. You can have complete resource utilization without paying a cent for idle time. The results? 90% cost-savings over a cloud VM, and the satisfaction of knowing that you never pay for resources you don’t use.

What is AWS Athena?

AWS Athena is a similar serverless service. It is more of an interactive query service than a code deployment service.

Using Athena one can directly query the data stored in S3 buckets and using standard ANSI SQL.

As mentioned earlier, it works on the principle of serverless, that is, there is no infrastructure to manage, and you pay only for the queries that you run.

Athena is easy to use. You can simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

It is based on Facebook’s PrestoDB and can be used to query structured and semi-structured data.

Some Exciting Features of Athena are:

  • Serverless. No ETL - Not having to set up and manage any servers or data warehouses.

  • Only pay for the data that is scanned.

  • You can ensure better performance by compressing, partitioning, and converting your data into columnar formats.

  • Can also handle complex analysis, including large joins, window functions, and arrays.

  • Athena automatically executes queries in parallel.

  • Need to provide a path to the S3 folder and when new files added automatically reflects in the table.

  • Supports -

    • Support CSV, Json, Parquet, ORC, Avro data formats

    • Complex Joins and datatypes

    • View creation

  • Does not Support -

    • User-defined functions and stored procedures

    • Hive or Presto transactions

    • LZO (Snappy is supported)

Pricing of Athena

  • AWS Athena is priced $5 for each TB of data scanned.

  • Queries are rounded up to the nearest MB, with a 10 MB minimum.

  • Users pay for stored data at regular S3 rates.

  • Amazon advises users to use compressed data files, have data in columnar formats, and routinely delete old results sets to keep charges low. Partitioning data in tables can speed up queries and reduce query bills.

Athena vs. Redshift Spectrum

  • AWS also has Redshift as data warehouse service, and we can use redshift spectrum to query S3 data, so then why should you use Athena?

Advantages of Redshift Spectrum:

  • Allows creation of Redshift tables. You’re able to join Redshift tables with Redshift spectrum tables efficiently.

If you do not need those things then you should consider Athena as well Athena differences from Redshift spectrum:

  • Billing. This is a major difference and depending on your use case you may find one much cheaper than the other Performance.

  • Athena slightly faster. SQL syntax and features.

  • Athena is derived from presto and is a bit different to Redshift which has its roots in Postgres.

  • It's easy enough to connect to Athena using API, JDBC or ODBC but many more products offer "standard out of the box" connection to Redshift.

  • Athena has GIS functions and lambdas.

So in nutshell, if you have existing instances of redshift you would probably go for Redshift Spectrum, if not then you can opt for Athena for querying the data. In some cases, you can use both in tandem.

Example

Here is a sample query to create a sample database having 3 tables basic_details, contact_details and bill_details, Uploaded csv file to s3:


Basic_details:

Bill_details:

Contact_details:

Sample Query for - FirstNames of People from Minnesota with amount_due > $100

Output:

AWS Athena 1.png

Some Other Sample Queries:

1. Searching for Values in JSON

Output:

AWS Athena 2.png

2. Extracting properties

Output:

AWS Athena 3.png

3. Converting JSON to Athena Data Types

Output:

AWS Athena 4.png

Conclusion

Hence, we can easily say that AWS Athena gives us an efficient way to query our raw data present in different formats in S3 object storage, without spawning a dedicated infrastructure and at minimal cost.

Did we miss anything? Do you have any questions about AWS Athena? Do let us know in the comments below! Need help with setting up AWS Athena for your organization? Connect with the experts at Velotio!

Velotio Technologies is an outsourced software product development partner for technology startups and enterprises. We combine innovative ideas with business expertise and cutting-edge technology to drive business success for our customers, help them put their customers' needs at the core, and ultimately, win in their chosen markets.


About the Author

download+%281%29.jpg

Sagar is a Senior Software Engineer at Velotio. He has worked on various open source and enterprise technologies fulfilling multiple roles, mostly on Python and AWS stack (SQS, Athena, Kinesis, Lambda etc). In free time he likes to watch web series, standup comedy and comedy sketches.