Data scientists working in Python or R sometimes purchase information by means of REST APIs. Both environments present libraries that assist you to make HTTP calls to REST endpoints, then rework JSON responses into dataframes. But that’s by no means so simple as we’d like. When you’re studying loads of information from a REST API, you’ll want to do it a web page at a time, however pagination works otherwise from one API to the following. So does unpacking the ensuing JSON constructions. HTTP and JSON are low-level requirements, and REST is a loosely-defined framework, however nothing ensures absolute simplicity, by no means thoughts consistency throughout APIs.
What if there have been a means of studying from APIs that abstracted all of the low-level grunt work and labored the identical means all over the place? Good information! That is strictly what Steampipe does. It’s a software that interprets REST API calls straight into SQL tables. Here are three examples of questions that you may ask and reply utilizing Steampipe.
1. Twitter: What are latest tweets that point out PySpark?
Here’s a SQL question to ask that query:
choose
id,
textual content
from
twitter_search_recent
the place
question = 'pyspark'
order by
created_at desc
restrict 5;
Here’s the reply:
+---------------------+------------------------------------------------------------------------------------------------>
| id | textual content >
+---------------------+------------------------------------------------------------------------------------------------>
| 1526351943249154050 | @dump Tenho trabalhando bastante com Spark, mas especificamente o PySpark. Vale a pena usar um >
| 1526336147856687105 | RT @MitchellvRijkom: PySpark Tip ⚡ >
| | >
| | When to make use of what StorageLevel for Cache / Persist? >
| | >
| | StorageLevel decides how and the place information must be s… >
| 1526322757880848385 | Solve challenges and exceed expectations with a profession as a AWS Pyspark Engineer. https://t.co/>
| 1526318637485010944 | RT @JosMiguelMoya1: #pyspark #spark #BigData curso completo de Python y Spark con PySpark >
| | >
| | https://t.co/qf0gIvNmyx >
| 1526318107228524545 | RT @money_personal: PySpark & AWS: Master Big Data With PySpark and AWS >
| | #ApacheSpark #AWSDatabases #BigData #PySpark #100DaysofCode >
| | -> http… >
+---------------------+------------------------------------------------------------------------------------------------>
The desk that’s being queried right here, twitter_search_recent, receives the output from Twitter’s /2/tweets/search/latest endpoint and formulates it as a desk with these columns. You don’t must make an HTTP name to that API endpoint or unpack the outcomes, you simply write a SQL question that refers back to the documented columns. One of these columns, question
, is particular: it encapsulates Twitter’s question syntax. Here, we’re simply on the lookout for tweets that match PySpark however we may as simply refine the question by pinning it to particular customers, URLs, sorts (is:retweet
, is:reply
), properties (has:mentions
, has_media
), and many others. That question syntax is similar irrespective of the way you’re accessing the API: from Python, from R, or from Steampipe. It’s a lot to consider, and all you must really want to know when crafting queries to mine Twitter information.
2. GitHub: What are repositories that point out PySpark?
Here’s a SQL question to ask that query:
choose
identify,
owner_login,
stargazers_count
from
github_search_repository
the place
question = 'pyspark'
order by stargazers_count desc
restrict 10;
Here’s the reply:
+----------------------+-------------------+------------------+
| identify | owner_login | stargazers_count |
+----------------------+-------------------+------------------+
| SynapseML | microsoft | 3297 |
| spark-nlp | JohnSnowLabs | 2725 |
| incubator-linkis | apache | 2524 |
| ibis | ibis-project | 1805 |
| spark-py-notebooks | jadianes | 1455 |
| petastorm | uber | 1423 |
| awesome-spark | awesome-spark | 1314 |
| sparkit-learn | lensacom | 1124 |
| sparkmagic | jupyter-incubator | 1121 |
| data-algorithms-book | mahmoudparsian | 1001 |
+----------------------+-------------------+------------------+
This seems to be similar to the primary instance! In this case, the desk that’s being queried, github_search_repository, receives the output from GitHub’s /search/repositories endpoint and formulates it as a desk with these columns.
In each circumstances the Steampipe documentation not solely exhibits you the schemas that govern the mapped tables, it additionally provides examples (Twitter, GitHub) of SQL queries that use the tables in varied methods.
Note that these are simply two of many accessible tables. The Twitter API is mapped to 7 tables, and the GitHub API is mapped to 41 tables.
3. Twitter + GitHub: What have homeowners of PySpark-related repositories tweeted recently?
To reply this query we have to seek the advice of two totally different APIs, then be part of their outcomes. That’s even tougher to do, in a constant means, once you’re reasoning over REST payloads in Python or R. But that is the sort of factor SQL was born to do. Here’s one strategy to ask the query in SQL.
-- discover pyspark repos
with github_repos as (
choose
identify,
owner_login,
stargazers_count
from
github_search_repository
the place
question = 'pyspark' and identify ~ 'pyspark'
order by stargazers_count desc
restrict 50
),
-- discover twitter handles of repo homeowners
github_users as (
choose
u.login,
u.twitter_username
from
github_user u
be part of
github_repos r
on
r.owner_login = u.login
the place
u.twitter_username will not be null
),
-- discover corresponding twitter customers
choose
id
from
twitter_user t
be part of
github_users g
on
t.username = g.twitter_username
)
-- discover tweets from these customers
choose
t.author->>'username' as twitter_user,
'https://twitter.com/' || (t.author->>'username') || '/standing/' || t.id as url,
t.textual content
from
twitter_user_tweet t
be part of
twitter_userids u
on
t.user_id = u.id
the place
t.created_at > now()::date - interval '1 week'
order by
t.creator
restrict 5
Here is the reply:
+----------------+---------------------------------------------------------------+------------------------------------->
| twitter_user | url | textual content >
+----------------+---------------------------------------------------------------+------------------------------------->
| idealoTech | https://twitter.com/idealoTech/status/1524688985649516544 | Are you capable of finding inventive soluti>
| | | >
| | | Join our @codility Order #API Challe>
| | | >
| | | #idealolife #codility #php >
| idealoTech | https://twitter.com/idealoTech/status/1526127469706854403 | Our #ProductDiscovery crew at idealo>
| | | >
| | | Think you may clear up it? 😎 >
| | | ➡️ https://t.co/ELfUfp94vB https://t>
| ioannides_alex | https://twitter.com/ioannides_alex/status/1525049398811574272 | RT @scikit_learn: scikit-learn 1.1 i>
| | | What's new? You can test the releas>
| | | >
| | | pip set up -U… >
| andfanilo | https://twitter.com/andfanilo/status/1524999923665711104 | @edelynn_belle Thanks! Sometimes it >
| andfanilo | https://twitter.com/andfanilo/status/1523676489081712640 | @juliafmorgado Good luck on the reco>
| | | >
| | | My recommendation: energy by means of it + a lifeless>
| | | >
| | | I hated my first few brief movies bu>
| | | >
| | | Looking ahead to the video 🙂
When APIs frictionlessly turn into tables, you may dedicate your full consideration to reasoning over the abstractions represented by these APIs. Larry Wall, the creator of Perl, famously stated: “Easy things should be easy, hard things should be possible.” The first two examples are issues that must be, and are, straightforward: every is simply 10 strains of straightforward, straight-ahead SQL that requires no wizardry in any respect.
The third instance is a tougher factor. It can be arduous in any programming language. But SQL makes it attainable in a number of good methods. The resolution is manufactured from concise stanzas (CTEs, Common Table Expressions) that kind a pipeline. Each section of the pipeline handles one clearly-defined piece of the issue. You can validate the output of every section earlier than continuing to the following. And you are able to do all this with probably the most mature and widely-used grammar for choice, filtering, and recombination of knowledge.
Do I’ve to make use of SQL?
No! If you want the thought of mapping APIs to tables, however you’ll fairly cause over these tables in Python or R dataframes, then Steampipe can oblige. Under the covers it’s Postgres, enhanced with overseas information wrappers that deal with the API-to-table transformation. Anything that may connect with Postgres can connect with Steampipe, together with SQL drivers like Python’s psycopg2
and R’s RPostgres
in addition to business-intelligence instruments like Metabase, Tableau, and PowerBI. So you need to use Steampipe to frictionlessly eat APIs into dataframes, then cause over the info in Python or R.
But for those who haven’t used SQL on this means earlier than, it’s value a glance. Consider this comparability of SQL to Pandas from How to rewrite your SQL queries in Pandas.
SQL | Pandas |
---|---|
choose * from airports | airports |
choose * from airports restrict 3 | airports.head(3) |
choose id from airports the place ident = ‘KLAX’ | airports[airports.ident == ‘KLAX’].id |
choose distinct kind from airport | airports.kind.distinctive() |
choose * from airports the place iso_region = ‘US-CA’ and kind = ‘seaplane_base’ | airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘seaplane_base’)] |
choose ident, identify, municipality from airports the place iso_region = ‘US-CA’ and kind = ‘large_airport’ | airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘large_airport’)][[‘ident’, ‘name’, ‘municipality’]] |
We can argue the deserves of 1 model versus the opposite, however there’s no query that SQL is probably the most common and widely-implemented strategy to categorical these operations on information. So no, you don’t have to make use of SQL to its fullest potential so as to profit from Steampipe. But you would possibly discover that you just wish to.