While at Pulse Secure I wrote a script in NodeJS to migrate data from a webchat
application (LiveChat) to a marketing application (Marketo), once hourly, (based on a conditional
decion
model) and log any errors encountered. Then, once per day, a second script would email the admin about
any new errors.
Here's a simple visual explanation of how the project worked, at a high level.
Read more below
The purpose of the project was to assist in keeping all marketing data in one
place: in Marketo's database. However, I could easily do the same sort of API integration project
with applications such as Salesforce, Netsuite, or SAP. It's simply a matter of using the API
endpoints to return to you the data that you're looking for.
APIs often have constraints though— in order to make sure their services don't get overburdened by
requests to the point that they shut down under the heavy load. In the case of Marketo, it has a
few
constraints which my project needed to stay within the confines of.
Since Marketo's API constraints are what really made project interesting, I thought share with you
an overview of how the script overcame these constraints (really, they're more like project
criteria, to be rolled up with system requriements as part of systems analysis)
The constraints were interesting because adhering to the constraints required control
mechanisms which moderated the speed & volume of the flow of data, regarding both get (data download) &
post (data upload) requests:
Batching get requests:
The script begins by checking the date of the last time it ran, and uses this date timeframe in its
search
for LiveChat data. It also does a refresh of its Marketo API key. Once it does that, it's ready to roll,
and the first real step in the data flow is to get LiveChat data. LiveChat's API doesn't really have
many
constraints to worry about. So, once we have that data, we can compare it to Marketo's data. But first
we
have to get the data from Marketo in a way which optimizes API usage efficiency.
The script needed to limit get requests for Marketo data into groups of up to 300. This means it can
return up to 300 rows of Marketo data-- based on the 300 email addresses from LiveChat, passed into
Marketo for looking up the rows. (Once I had the LiveChat visitor data, then I needed to cross reference
which of the visitors are already listed in our marekto instance, by looking them up by email (used as a
unique id), which required batching into groups of 300)
Batching post requests:
Once the LiveChat data was compared to the Marketo data, and I knew which data needed to float up to
Marekto (based on certain conditions, for example: if their geographical data already existed in
Marketo,
do not overwrite it. If it does not exist though, add it to the data object for upload.) There's a limit
of 10 concurrent API requests per second. NodeJS is asynchronous though. Which makes it very fast. But
it's also a bit tricky to control. The problem is that without controlling the stream of post requests--
for example, let's say there were 100 visitors in the past hour-- the API constraints would be quickly
overwhelmed and marketo would refuse to accept all of our post requests.
That is, uploading those 100 units of visitor data to Marketo at full speed (thanks to asynchronicity)
would be delivered within perhaps half a second or so. The problem is that Marketo doesn't want to be
pinged so many times, so quickly. It wants some breathing space.
The way I ended up controlling these batches of post requests so that all 100 data objects both don't
fire
all at once, immediately, is by using the NodeJS async library's
queue() function, which allows the
developer to issue the post calls in small batches (adjustable) and at throttled increments (also
adjustable). This allowed me to overcome the Rate limit and the Concurrency limit.
Why does this matter?
Basically, with asynchronous functions, they all fire immediately. With a get request function though,
two
main things happen: we fire our function (the "request"), and then we receive a response from the server
we're communicating with. This entire process is one concurrent API call. With asynchronous function
calls
though, the speed of the request/response cycle depends a lot on just internet latency speeds. This
means
it's hard to predict how long the request will take.
Now, the concurrency limit says there's a max of 10 per second. So, what we could do is say, run one
batch
of 10 per second, right? Wrong. Unfortunately, it's very possible that one or some of those requests
will
take a little longer than others. So, I had to adjust the batches to something like two batches of four
per second, just as an example, which adds up to eight per second. Another example that could work is
one
batch of seven per 750ms.
But again, it the success of these depends in part on internet latency, hence it's wise to err on the
side
of being conservative, and reducing the API calls per second to whatever amount makes sense, to ensure
that it's very unlikely that we surpass the limit. This is why asynchronous programming is challenging,
fun, and interesting: because it is sometimes difficult to control precisely.