Dataworks
Hot-swap API endpoints and stream processors using Clojure and Dataworks
Install / Use
/learn @acgollapalli/DataworksREADME
- Dataworks ** Introduction Dataworks is a distributed-by-default stored-function engine with REST API and stream processing capabilities built on top of a bitemporal, graph-query enabled document-store. ** Some Examples Using the utils in the dataworks.dev.utils namespace, we can define a new endpoint (or collector as we call it) on a running dataworks cluster. (Make sure to point the url at your dataworks cluster.) #+BEGIN_SRC clojure (def-collector :hello-world "hello-world" {:id :hello :methods {:get {:produces #{text/plain} :response (fn [ctx] "Hello world!")}}}) (send-fn :collector/hello) #+END_SRC
Now your endpoint is available at <your-app-url>/user/hello-world. Try it at the command line: #+BEGIN_SRC shell curl localhost:3000/user/hello-world #+END_SRC
That endpoint will be added on all of the nodes in your cluster, without any intervention on your part.
Let's say you want to change it: #+BEGIN_SRC clojure (def-collector :hello-world "hello-world" {:id :hello :methods {:get {:produces #{text/plain} :response (fn [ctx] "Good-by cruel world!")}}}) (send-fn :collector/hello) #+END_SRC
Try and get that same endpoint after having made the changes, it should return the new value!
Let's say we want to do some stream processing. That's easy too. Let's take from a kafka topic, conveniently called "input ", increase the value, and then send it to another topic, conveniently called "output": #+BEGIN_SRC clojure (def-stream :kafka/input) (send-fn :stream/kafka/input)
(def-stream :stream/process {:buffer 5 :transducer (comp (map :value) (map inc)) :upstream #{:kafka/input}}) (send-fn :stream/stream/process)
(def-stream :kafka/output {:upstream #{:stream/process}}) (send-fn :stream/kafka/output) #+END_SRC
Want your process to decrement instead of increment? #+BEGIN_SRC clojure (def-stream :stream/process {:buffer 5 :transducer (comp (map :value) (map dec)) :upstream #{:kafka/input}}) (send-fn :stream/stream/process) #+END_SRC
All the messages from before you changed the stream processor will be incremented, but all the ones after you made the change will be decremented.
Now let's say you want to add another node. All you have to do is point it at a running kafka broker in your config.edn and start it up. Your database, stream-processors, and api endpoints will automagically stay in sync.
Easy as pi!
** Technical Case *** What does it do? And what is it good for?
- Dataworks is an easy and powerful way to build and deploy REST API's
- Dataworks is an easy and powerful way to build and deploy stream processors
- Dataworks stores your code in a database, and synchronizes with your code/database as it changes. The code is evaluated at runtime, and on change. This includes API endpoints and routing. Changes are propagated to all instances automatically. *** There are plenty of other easy and powerful ways to do those things. What makes this different?
- Normally, when you modify a REST API, unless each endpoint is it's own process, you have to redeploy the entire API or service. With Dataworks you can hot swap endpoints in a running application, without effecting any other endpoints on your service. You can add new functionality, without effecting any of the old functionality. It takes much of the complexity of continuous deployment out of the equation. When you modify, add, or delete code on one instance, the changes propagate to all the nodes. Which means deployment is essentially only ever done on upgrades of dataworks itself, and the rest is just updating data in a business application.
- You can use transducers on kafka topics as though they were clojure.core.async channels. If you're a clojurian, then you probably already have a feeling of how powerful that is. If not, then just think about it as being able to do stream processing with a very powerful programming language and next-to-no configuration.
- Your db, stream processing, and web api are seamlessly integrated, and are themselves just clojure functions. Dataworks ships with [[https://opencrux.com/][openCRUX]] for its database, meaning that your dataworks node is also a node of your distributed database, and your database is always close at hand.
- Configuration is as simple as pointing your dataworks instance to a kafka broker. *** For clojure hackers If you're a clojure hacker, then you're familiar with the existing way of creating and running clojure programs:
- create a namespace
- add dependencies
- write your functions one after the other, each depending on previous functions.
- recur
- When you have the functionality you want, then you create an uberjar, and deploy it in whatever manner you or your company see fit to do so. Dataworks... doesn't really follow that model. As we all know, "code is data", so dataworks does what we always do with data, which is to store it in a database, specifically in a graph database. (While we don't really exploit this capability to the fullest, we will be looking to in the future). Each stored-function get's it's own document and is treated as its own entity. It is evaluated at runtime, and reevaluated as code changes. This presents advantages and disadvantages. **** Advantages
- Instances do not have to be redeployed when code changes. Because instances are kept synchronized with the database as the application runs, number of deployments is minimalized, and operations complexity reduced.
- Your code can be queried. (we're working on this, give it time)
- Instead of working at the level of the file or namespace, you're really working at the level of the function. So far as the application is concerned, what is changing, all that is really changing, are functions. You're no longer managing services, or jar files. You're managing functions.
- You can easily make dataworks do things that you wouldn't expect, because in dataworks you can do anything. It's just a way of storing and deploying functions, with some handy utils built in. You might even say "It's just a library," although it really is more on the framework side of things. Actually, it's more like "the thing your code runs on", not even a part of /your/ code at all.
- Dependencies are handled on the level of the function, not the namespace. (and even that is a work in progress)
- You no longer have to worry about circular dependencies, because they're allowed.
- You no longer have to wait for
lein uberjarto create a build. - It's cool. **** Disadvantages
- We don't yet have records, protocols, or multimethods. If you really, really, need those things, then you might want to wait a couple releases.
- Dependencies are added by adding to the classpath. We don't yet have an automated way to handle this. (will be handled before 1.0 release)
- Dependencies are handled on the level of the function, not the namespace. As mentioned before, the way dependencies are handled in dataworks is slightly different. We believe this is actually an advantage in the long run, but some may disagree. It's worth noting that so long as functions/classes are on the same classpath of your process, the code is always accessible. There isn't really any isolation of dependencies, but this is true of clojure in general, to the best of my knowledge.
- You can't just
lein uberjaryour build. You have to send your code to dataworks via REST API. This is also, debateably, an advantage, and we did it because we believe that it is one.
*** Caveat Hacker This release (0.5) is a naive release. If premature optimization is the root of all evil, then we shall be good's greatest friend, as in these paren-wrapped files there is no optimization in danger of being premature. All implementation of functionality is thoroughly naive, and sometimes downright crude. As we have done our best to choose good bits of code and get out of their way, I would not be surprised if you got good performance right out of the gate. But I would be even more unsurprised if you didn't, and it were entirely my fault. So don't use it in critical production applications yet. And if you do choose to use it in a critical production application, do read the source code and judge it for yourself. If you believe in test driven development, then I should warn you that there are no tests. Writing tests, writing optimizations, and capturing edge-cases/corner-cases are all things that will come as we proceed to 1.0. This code is at version 0.5 for a reason: it is only halfway to where it needs to be. ** Business Case For many years the way of managing the business logic of enterprise systems was by using stored procedures in a SQL database (at the behest of DBA's primarily). For many businesses, the SQL database is the single most important part of their entire operation, the coordinating capstone, without which the enterprise would not be able to function. The management of business logic within the SQL database itself allowed for the management of access to the database, as well as optimization and management by database administrators in order to preserve the integrity and availability of the SQL database, and thus the information heart of the enterprise.
Due to the increasing requirements of programmers in order to create more powerful applications for the sake of the enterprise, such an architecture became infeasible, as the stored procedure language, SQL presented insuficient capabilities for creating abstractions, resulting in productivity loss and lengthy, expensive development of new features and business functionality. Thus programmers began creating applications which called the SQL database, but were not contained externally. This resulted in multiple codebases, multiple projects, multiple project managers, and many different pipelines to developing business functionality, all of which increases com
