Wikipedia

Search results

Saturday, 19 September 2015

Exclusive Interview: Matei Zaharia, creator of Apache Spark, on Spark, Hadoop, Flink, and Big Data in 2020

 

Apache Spark, which is a fast general engine for Big Data processing, is one the hottest Big Data technologies in 2015. It was created by Matei Zaharia, a brilliant young researcher, when he was a graduate student at UC Berkeley around 2009. Since....[More]

Apache Spark 1.5 presented by Databricks co-founder Patrick Wendell

 

Spark 1.5 ships Spark's Project Tungsten initiative, a cross-cutting performance update that uses binary memory management and code generation to dramatically improve latency of most Spark jobs. This release also includes several updates to Spark's DataFrame API and SQL optimizer, along with new Machine Learning algorithms and feature transformers, and several new features in Spark's native streaming engine


Spark DataFrames: Simple and Fast Analysis of Structured Data



This session will provide a technical overview of Spark’s DataFrame API. First, we’ll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or structured file formats like Avro. We’ll then give example user programs that operate on DataFrames and point out...[More]

Saturday, 5 September 2015

Advanced Spark

Good one in Spark summit...

New Features in Machine Learning Pipelines in Spark 1.4 

 

Spark 1.2 introduced Machine Learning (ML) Pipelines to facilitate the creation, tuning, and inspection of practical ML workflows.  Spark’s latest release, Spark 1.4, significantly extends the ML library.  In this post, we highlight  several new features in the....[More]

ML Pipelines: A New High-Level API for MLlib

MLlib’s goal is to make practical machine learning (ML) scalable and easy. Besides new algorithms and performance improvements that we have seen in each release, a great deal of time and effort has been spent on making MLlib easy. Similar to Spark Core, MLlib provides APIs in three languages: Python, Java, and Scala, along with user guide and example code, to ease the learning curve for users...[More]

New Features in Machine Learning Pipelines in Spark 1.4 

 

Spark 1.2 introduced Machine Learning (ML) Pipelines to facilitate the creation, tuning, and inspection of practical ML workflows.  Spark’s latest release, Spark 1.4, significantly extends the ML library.  In this post, we highlight  several new features in the...[More]

Simplify Machine Learning on Spark with Databricks 

 

As many data scientists and engineers can attest, the majority of the time is spent not on the models themselves but on the supporting infrastructure.  Key issues include on the ability to easily visualize, share, deploy, and schedule jobs.  More disconcerting is the need for data engineers to re-implement the models developed by data scientists for production.  With Databricks, data scientists and engineers can simplify these....[More]

Scalable Collaborative Filtering with Spark MLlib 

 

Recommendation systems are among the most popular applications of machine learning. The idea is to predict whether a customer would like a certain item: a product, a movie, or a song. Scale is a key concern for recommendation systems, since computational complexity increases with the size of a company’s customer base. In this blog post, we discuss how Spark MLlib enables building recommendation  .....[More]

Spark MLib - Use Case


In this chapter, we will use MLlib to make personalized movie recommendations tailored for you. We will work with 10 million ratings from 72,000 users on 10,000 movies, collected..[More]

Apache Spark - MLlib Introduction

 

In one of our earlier posts we have mentioned that we use Scalding (among others) for writing MR jobs. Scala/Scalding simplifies the implementation of many MR patterns and makes it easy to implement quite complex jobs like machine learning algorithms. Map Reduce is a mature and widely used framework and it is a good choice for processing large amounts of data – but not as great if you’d like to use it for fast iterative algorithms/processing. This is a use case...[More]




Friday, 4 September 2015

Demo: Apache Spark on MapR with MLlib 

 Editor's Note: In this demo we are using Spark and PySpark to process and analyze the data set, calculate aggregate statistics about the user base in a PySpark script, persist all of that back into MapR-DB for use in Spark and Tableau, and finally use MLlib to build ...[more] 


Big Data Ecosystem – Spark and Tableau

 

In this article we give you the big picture of how Big Data fits in your actual BI architecture and how to connect Tableau to Spark to enrich your current BI reports and dashboards with data ...Read More....

Realtime Advanced Analytics: Spark Streaming+Kafka, MLlib/GraphX, SQL/DataFrames

We'll present a real-world, open source, advanced analytics and machine learning pipeline using all "15" Open Source technologies listed below.
This Meetup is based on my recent "Top-5" Hadoop Summit/Data Science talk called "Spark After Dark". Spark After Dark is a mock online dating site...[Read More]

 

Tips to Create Proposal


If you're in the services or consulting business, you know all about RFPs: Requests for Proposal are how many professional agencies win new work. NMC receives a lot of them from organizations around the world wanting either to upgrade their existing web presence or start from scratch with a new one. Some of them are clear, detailed, and provide the right kind of information to help us quickly write a great proposal. Others, not so much! Keeping up with web technologies that change daily is a full-time job, which is probably they're looking for.....

Spark Streaming More

Spark Streaming is an extension of the core Spark API that allows enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or plain old TCP sockets and be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning algorithms, ......

Saturday, 17 January 2015

Part four of the API design best practices series. Read part one: Plan Your API.
Or jump to part one of the hypermedia sub-series.

A Road Trip

First off, let me apologize for the delay in this third part of the hypermedia sub-series. Christmas meant a warm trip back to Minnesota, a road trip through the Texas panhandle, and numerous snow storms in between — until finally I had the chance to cut through the mountainous desert of Southern California on my way back to beautiful San Francisco.
Now I understand some of you are probably wondering what any of that has to do with this post, other than it’s about 3 weeks after promised. One of the greatest challenges of the drive was battling my way through the snow and construction, and just hoping that the interstate would stay open (they literally close the interstates if it’s bad enough). But the one thing I could be sure of was that at every turn, between my steady GPS and road signs, I knew where I was going, and I knew when my path was being detoured or I couldn’t take a certain road… I knew this, because everything was nice and uniform.
In a lot of ways, APIs are like roads — they are designed to help us transport data from one point to another. But unfortunately, unlike the DOT system that spans the country, the directions (hypermedia) aren’t always uniform, and depending on the API we use, we’ll probably have to utilize a different hypermedia spec — one that may or may not provide the same information as others.
What essentially every hypertext linking spec does provide is a name for the link and a hypertext reference, but outside of that, it’s a crapshoot. As such, it’s important to understand the different specs that are out there, which ones are leading the industry, and which ones meet your needs. We may not be able to get it down to one spec, but at least we’ll be able to provide our users with a uniform response that they can easily incorporate into their application:

Collection+JSON

Collection+JSON is a JSON-based read/write hypermedia-type designed by Mike Amundsen back in 2011 to support the management and querying of simple collections. It’s based on the Atom Publication and Syndication specs, defining both in a single spec and supporting simple queries through the use of templates. While originally widely used among APIs, Collection+JSON has struggled to maintain its popularity against and .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
{ "collection" :
{
"version" : "1.0",
"href" : "http://example.org/friends/",
"links" : [
{"rel" : "feed", "href" : "http://example.org/friends/rss"},
{"rel" : "queries", "href" : "http://example.org/friends/?queries"},
{"rel" : "template", "href" : "http://example.org/friends/?template"}
],
"items" : [
{
"href" : "http://example.org/friends/jdoe",
"data" : [
{"name" : "full-name", "value" : "J. Doe", "prompt" : "Full Name"},
{"name" : "email", "value" : "jdoe@example.org", "prompt" : "Email"}
],
"links" : [
{"rel" : "blog", "href" : "http://examples.org/blogs/jdoe", "prompt" : "Blog"},
{"rel" : "avatar", "href" : "http://examples.org/images/jdoe", "prompt" : "Avatar", "render" : "image"}
]
}
]
}
}
view raw Collection+JSON.json hosted with ❤ by GitHub
Strengths: strong choice for collections, templated queries, early wide adoption, recognized as a standard
Weaknesses: JSON only, lack of identifier for documentation, more complex/ difficult to implement

JSON API

JSON API is a newer spec created in 2013 by Steve Klabnik and Yahuda Klaz. It was designed to ensure separation between clients and servers (an important aspect of REST) while also minimizing the number of requests without compromising readability, flexibility, or discovery. JSON API has quickly become a favorite receiving wide adoption and is arguably one of the leading specs for JSON based APIs. JSON API currently bares a warning that it is a work in progress, and while widely adopted not necessarily stable.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
{
"links": {
"posts.author": {
"href": "http://example.com/people/{posts.author}",
"type": "people"
},
"posts.comments": {
"href": "http://example.com/comments/{posts.comments}",
"type": "comments"
}
},
"posts": [{
"id": "1",
"title": "Rails is Omakase",
"links": {
"author": "9",
"comments": [ "5", "12", "17", "20" ]
}
}]
}
Strengths: simple versatile format, easy to read/ implement, flat link grouping, URL templating, wide adoption, strong community, recognized as a hypermedia standard
Weaknesses: JSON only, lack of identifier for documentation, still a work in progress

HAL

HAL is an older spec, created in 2011 by Mike Kelly to be easily consumed across multiple formats including XML and JSON. One of the key strengths of HAL is that it is nestable, meaning that _links can be incorporated within each item of a collection. HAL also incorporates CURIEs, a feature that makes it unique in that it allows for inclusion of documentation links in the response – albeit they are tightly coupled to the link name. HAL is one of the most supported and most widely used hypermedia specs out there today, and is surrounded by a strong and vocal community.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
{
"_links": {
"self": { "href": "/orders" },
"curies": [{ "name": "ea", "href": "http://example.com/docs/rels/{rel}", "templated": true }],
"next": { "href": "/orders?page=2" },
"ea:find": {
"href": "/orders{?id}",
"templated": true
},
"ea:admin": [{
"href": "/admins/2",
"title": "Fred"
}, {
"href": "/admins/5",
"title": "Kate"
}]
},
"currentlyProcessing": 14,
"shippedToday": 20,
"_embedded": {
"ea:order": [{
"_links": {
"self": { "href": "/orders/123" },
"ea:basket": { "href": "/baskets/98712" },
"ea:customer": { "href": "/customers/7809" }
},
"total": 30.00,
"currency": "USD",
"status": "shipped"
}, {
"_links": {
"self": { "href": "/orders/124" },
"ea:basket": { "href": "/baskets/97213" },
"ea:customer": { "href": "/customers/12369" }
},
"total": 20.00,
"currency": "USD",
"status": "processing"
}]
}
}
Strengths: dynamic, nestable, easy to read/ implement, multi-format, URL templating, inclusion of documentation, wide adoption, strong community, recognized as a standard hypermedia spec, RFC proposed
Weaknesses: JSON/XML formats architecturally different, CURIEs are tightly coupled

JSON-LD

JSON-LD is a lightweight spec focused on machine to machine readable data. Beyond just RESTful APIs, JSON-LD was also designed to be utilized within non-structured or NoSQL databases such as MongoDB or CouchDB. Developed by the W3C JSON-LD Community group, and formally recommended by W3C as a JSON data linking spec in early 2014, the spec has struggled to keep pace with JSON API and HAL. However, it has built a strong community around it with a fairly active mailing list, weekly meetings, and an active IRC channel.
1 2 3 4 5 6 7
{
"@context": "http://json-ld.org/contexts/person.jsonld",
"@id": "http://dbpedia.org/resource/John_Lennon",
"name": "John Lennon",
"born": "1940-10-09",
"spouse": "http://dbpedia.org/resource/Cynthia_Lennon"
}
Strengths: strong format for data linking, can be used across multiple data formats (Web API & Databases), strong community, large working group, recognized by W3C as a standard
Weaknesses: JSON only, more complex to integrate/ interpret, no identifier for documentation

Created in 2012 by Kevin Swiber, Siren is a more descriptive spec made up of classes, entities, actions, and links. It was designed specifically for Web API clients in order to communicate entity information, actions for executing state transitions, and client navigation/ discoverability within the API. Siren was also designed to allow for sub-entities or nesting, as well as multiple formats including XML – although no example or documentation regarding XML usage is provided. Despite being well intentioned and versatile, Siren has struggled to gain the same level of attention as JSON API and HAL. Siren is still listed as a work in progress.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
{
"class": [ "order" ],
"properties": {
"orderNumber": 42,
"itemCount": 3,
"status": "pending"
},
"entities": [
{
"class": [ "items", "collection" ],
"rel": [ "http://x.io/rels/order-items" ],
"href": "http://api.x.io/orders/42/items"
},
{
"class": [ "info", "customer" ],
"rel": [ "http://x.io/rels/customer" ],
"properties": {
"customerId": "pj123",
"name": "Peter Joseph"
},
"links": [
{ "rel": [ "self" ], "href": "http://api.x.io/customers/pj123" }
]
}
],
"actions": [
{
"name": "add-item",
"title": "Add Item",
"method": "POST",
"href": "http://api.x.io/orders/42/items",
"type": "application/x-www-form-urlencoded",
"fields": [
{ "name": "orderNumber", "type": "hidden", "value": "42" },
{ "name": "productCode", "type": "text" },
{ "name": "quantity", "type": "number" }
]
}
],
"links": [
{ "rel": [ "self" ], "href": "http://api.x.io/orders/42" },
{ "rel": [ "previous" ], "href": "http://api.x.io/orders/41" },
{ "rel": [ "next" ], "href": "http://api.x.io/orders/43" }
]
}
Strengths: provides a more verbose spec, query templating, incorporates actions, multi-format
Weaknesses: poor adoption, lacks documentation, work in progress

Other Specs

Along with some of the leading specs mentioned above, new specs are being created every day including UBER, Mason, Yahapi, and CPHL. This presents a very interesting question, and that is are we reinventing the wheel, or is there something missing in the above specs. I believe the answer is a combination of both, with developers being notorious for reinventing the wheel, but also because each developer looks at the strengths and weaknesses of other specs and envisions a better way of doing things.
You may recognize this issue from the last post, where some specs were modified by the companies using them to meet their individual needs. For example, PayPal wanted to include methods in their response, but you’ll notice that only Siren of the above include methods in the link definition.

The Future of Specs

Given that new specs are being created every day, each with different ideas and in different formats, it’s extremely important to keep your system as decoupled and versatile as possible, and it will be very interesting to see what the future of hypermedia specs will look like.
In the mean-time, it’s best to choose the spec that meets your needs, while also being recognized as a standard for easy integration by developers. Of the specs above, I would personally recommend sticking with HAL or JSON API, although each has its own strengths and weaknesses, and I believe that universal spec of the future has yet to be created. But by adhering to these common specs while the new specs battle things out, I think we will finally find that standard method of road signs, detours, and a single solution to provide API clients with a standardized GPS system.
For more on the different specs, I highly recommend reading Kevin Sookocheff’s review. I’d also love to hear your thoughts in the comments below.
And if you have your own pictures of @MaxtheMule, be sure to join our Champions Program and share them to earn points towards cool rewards such as iPads, 3D Printers, MuleSoft training and swag, drones, USB-powered fridges, and so much more!
Part four of the API design best practices series. Read part one: Plan Your API.
Or jump to part one of the hypermedia sub-series.

The Harsh Reality of the State of Hypermedia Specs

Hypermedia sounds great in theory, but theory only goes so far. Where really shines, or completely fails, is in implementation. Unfortunately, as is still a relatively new aspect of web based APIs, there isn’t one specified way of doing things. In fact, you’ll find that even some of the most popular APIs operate completely differently from each other.
After all, there are several different hypermedia formats available for API providers to choose from. Just for starters there is HAL, Collection+JSON, JSON-LD, JSON API, and Siren! But the list doesn’t stop there, as some popular APIs have even elected to create their own format.
For example, while PayPal’s API closely mimics the JSON API format, it goes a step further and adds a method property (not part of the JSON API spec), creating a more flexible spec and transforming it from being resource driven to being action driven:
1 2 3 4 5 6 7 8 9 10 11
"links" : [
{
"href" : "https://api.sandbox.paypal.com/v1/payments/payment/PAY-2XR800907F429382MKEBWOSA",
"rel" : "self",
"method" : "GET"
}, {
"href" : "https://api.sandbox.paypal.com/v1/payments/payment/PAY-2XR800907F429382MKEBWOSA/execute",
"rel" : "update",
"method" : "POST"
}
]
This has the potential to let developers create a more agile client based on the actions (and methods) available to them. However, for developers not familiar with PayPal’s format, but familiar with JSON API this may cause slight confusion (although it should be quickly remedied by reading their docs).
VerticalResponse, on the other hand, has taken a different, albeit interesting approach. For their API they likewise start with the basic JSON API format, but for some reason decided against the universally accepted “href” or Hypertext Reference property, instead opting to use “url” or the uniform resource locator as the link URI identifier:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
"links": {
"up": {
"url": "https://vrapi.verticalresponse.com/api/v1/contacts"
},
"lists": {
"url": "https://vrapi.verticalresponse.com/api/v1/contacts/1099513934863/lists"
},
"messages": {
"url": "https://vrapi.verticalresponse.com/api/v1/contacts/1099513934863/messages"
},
"stats": {
"url": "https://vrapi.verticalresponse.com/api/v1/contacts/1099513934863/stats"
}
}
Personally, I would recommend staying with the uniform “href” attribute as it denotes a reference to a hypertext link and is not as exclusive as an URL- which is not (although it commonly is) to be confused with URI. But you can read more on that here.
On the other hand, Amazon’s AppStream API, Clarify, Microsoft’s Lync, and FoxyCart all prefer to follow HAL or the Hypertext Application Language format. HAL provides a simple format for nest-able links, but like other specs omits the methods property as included by PayPal, making theirs truly unique in that sense:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
{
"_links": {
"self": {
"href": "https://api.foxycart.com/taxes/31588",
"title": "This Tax"
},
"https://api.foxycart.com/rels/store": {
"href": "https://api.foxycart.com/stores/66",
"title": "This Store"
},
"https://api.foxycart.com/rels/tax_item_categories": {
"href": "https://api.foxycart.com/taxes/31588/tax_item_categories",
"title": "Tax Item Category relationships"
}
}
}
However FoxyCart takes it one step further, not only taking advantage of hypermedia, but offering multiple formats for their clients to choose from, including HAL+JSON, HAL+XML, and Siren.
This, however, highlights once again one of the biggest challenges with hypermedia driven APIs, the abundance of ideas and specs available for execution. While on one-hand I believe that by supporting both XML and JSON, as well as having multiple JSON formats FoxyCart is by far the most flexible (format wise) of the APIs, not having a singular standard for each language does present the challenge of forcing developers (and hypermedia clients) to support multiple formats (as they integrate more and more hypermedia APIs), while also having the understanding that not one format meets every API’s needs.
The good news, is that despite these growing pains, we are starting to see companies adopting certain specs over others, while also identifying areas for improvement (such as with PayPal’s adding of methods to JSON API). Next week we’ll take a look at some of the most popular formats out there in-depth, keying in on the strengths and weaknesses of each.
But it’s important that as you build your API, you understand WHY you are building it the way you are. And this extends into how you build your hypermedia links, and whether or not you choose to take advantage of a standardized format (recommended), or venture off on your own to meet your developers’ needs. One of the best ways to do this is to explore what others have done with their APIs, and learn from their successes, and their mistakes.
It’s also important to consider where technology is going. And as more and more formats become available and change in popularity, it may be smart to follow FoxyCart’s lead – taking advantage of the spec that best meets your developers’ needs, but also keeping the link format decoupled enough from your data that you are able to return multiple formats based on the content-type received. Something that will allow you to take advantage of this best practice, while also being prepared for whatever the future may hold.