Saturday, 19 September 2015

Exclusive Interview: Matei Zaharia, creator of Apache Spark, on Spark, Hadoop, Flink, and Big Data in 2020

Apache Spark, which is a fast general engine for Big Data processing, is one the hottest Big Data technologies in 2015. It was created by Matei Zaharia, a brilliant young researcher, when he was a graduate student at UC Berkeley around 2009. Since....[More]

Apache Spark 1.5 presented by Databricks co-founder Patrick Wendell

Spark 1.5 ships Spark's Project Tungsten initiative, a cross-cutting performance update that uses binary memory management and code generation to dramatically improve latency of most Spark jobs. This release also includes several updates to Spark's DataFrame API and SQL optimizer, along with new Machine Learning algorithms and feature transformers, and several new features in Spark's native streaming engine

Spark DataFrames: Simple and Fast Analysis of Structured Data

This session will provide a technical overview of Spark’s DataFrame API. First, we’ll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or structured file formats like Avro. We’ll then give example user programs that operate on DataFrames and point out...[More]

Saturday, 5 September 2015

Spark Graphx

Spark Machine Library

Advanced Spark

Good one in Spark summit...

New Features in Machine Learning Pipelines in Spark 1.4

Spark 1.2 introduced Machine Learning (ML) Pipelines to facilitate the creation, tuning, and inspection of practical ML workflows. Spark’s latest release, Spark 1.4, significantly extends the ML library. In this post, we highlight several new features in the....[More]

ML Pipelines: A New High-Level API for MLlib

MLlib’s goal is to make practical machine learning (ML) scalable and easy. Besides new algorithms and performance improvements that we have seen in each release, a great deal of time and effort has been spent on making MLlib easy. Similar to Spark Core, MLlib provides APIs in three languages: Python, Java, and Scala, along with user guide and example code, to ease the learning curve for users...[More]

New Features in Machine Learning Pipelines in Spark 1.4

Spark 1.2 introduced Machine Learning (ML) Pipelines to facilitate the creation, tuning, and inspection of practical ML workflows. Spark’s latest release, Spark 1.4, significantly extends the ML library. In this post, we highlight several new features in the...[More]

Simplify Machine Learning on Spark with Databricks

As many data scientists and engineers can attest, the majority of the time is spent not on the models themselves but on the supporting infrastructure. Key issues include on the ability to easily visualize, share, deploy, and schedule jobs. More disconcerting is the need for data engineers to re-implement the models developed by data scientists for production. With Databricks, data scientists and engineers can simplify these....[More]

Scalable Collaborative Filtering with Spark MLlib

Recommendation systems are among the most popular applications of machine learning. The idea is to predict whether a customer would like a certain item: a product, a movie, or a song. Scale is a key concern for recommendation systems, since computational complexity increases with the size of a company’s customer base. In this blog post, we discuss how Spark MLlib enables building recommendation .....[More]

Spark MLib - Use Case

In this chapter, we will use MLlib to make personalized movie recommendations tailored for you. We will work with 10 million ratings from 72,000 users on 10,000 movies, collected..[More]

Apache Spark - MLlib Introduction

In one of our earlier posts we have mentioned that we use Scalding (among others) for writing MR jobs. Scala/Scalding simplifies the implementation of many MR patterns and makes it easy to implement quite complex jobs like machine learning algorithms. Map Reduce is a mature and widely used framework and it is a good choice for processing large amounts of data – but not as great if you’d like to use it for fast iterative algorithms/processing. This is a use case...[More]

Friday, 4 September 2015

Apache Spark on MapR with MLlib

Demo: Apache Spark on MapR with MLlib

Editor's Note: In this demo we are using Spark and PySpark to process and analyze the data set, calculate aggregate statistics about the user base in a PySpark script, persist all of that back into MapR-DB for use in Spark and Tableau, and finally use MLlib to build ...[more]

Big Data Ecosystem – Spark and Tableau

In this article we give you the big picture of how Big Data fits in your actual BI architecture and how to connect Tableau to Spark to enrich your current BI reports and dashboards with data ...Read More....

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Realtime Advanced Analytics: Spark Streaming+Kafka, MLlib/GraphX, SQL/DataFrames

We'll present a real-world, open source, advanced analytics and machine learning pipeline using all "15" Open Source technologies listed below.
This Meetup is based on my recent "Top-5" Hadoop Summit/Data Science talk called "Spark After Dark". Spark After Dark is a mock online dating site...[Read More]

How Concur uses Big Data to get you to Tableau Conference On Time

MySQL and Hive Configuration

Tips to Create Proposal

If you're in the services or consulting business, you know all about RFPs: Requests for Proposal are how many professional agencies win new work. NMC receives a lot of them from organizations around the world wanting either to upgrade their existing web presence or start from scratch with a new one. Some of them are clear, detailed, and provide the right kind of information to help us quickly write a great proposal. Others, not so much! Keeping up with web technologies that change daily is a full-time job, which is probably they're looking for.....

Spark Streaming More

Spark Streaming is an extension of the core Spark API that allows enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or plain old TCP sockets and be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning algorithms, ......

Apache Spark: The Next Big (Data) Thing?

Integration One

Wikipedia

Saturday, 19 September 2015

Exclusive Interview: Matei Zaharia, creator of Apache Spark, on Spark, Hadoop, Flink, and Big Data in 2020

Apache Spark, which is a fast general engine for Big Data processing, is one the hottest Big Data technologies in 2015. It was created by Matei Zaharia, a brilliant young researcher, when he was a graduate student at UC Berkeley around 2009. Since....[More]

Apache Spark 1.5 presented by Databricks co-founder Patrick Wendell

Spark DataFrames: Simple and Fast Analysis of Structured Data

Saturday, 5 September 2015

Spark Graphx

New Features in Machine Learning Pipelines in Spark 1.4

Spark 1.2 introduced Machine Learning (ML) Pipelines to facilitate the creation, tuning, and inspection of practical ML workflows. Spark’s latest release, Spark 1.4, significantly extends the ML library. In this post, we highlight several new features in the....[More]

ML Pipelines: A New High-Level API for MLlib

New Features in Machine Learning Pipelines in Spark 1.4

Spark 1.2 introduced Machine Learning (ML) Pipelines to facilitate the creation, tuning, and inspection of practical ML workflows. Spark’s latest release, Spark 1.4, significantly extends the ML library. In this post, we highlight several new features in the...[More]

Simplify Machine Learning on Spark with Databricks

Scalable Collaborative Filtering with Spark MLlib

Spark MLib - Use Case

In this chapter, we will use MLlib to make personalized movie recommendations tailored for you. We will work with 10 million ratings from 72,000 users on 10,000 movies, collected..[More]

Apache Spark - MLlib Introduction

Friday, 4 September 2015

Apache Spark on MapR with MLlib

Demo: Apache Spark on MapR with MLlib

Editor's Note: In this demo we are using Spark and PySpark to process and analyze the data set, calculate aggregate statistics about the user base in a PySpark script, persist all of that back into MapR-DB for use in Spark and Tableau, and finally use MLlib to build ...[more]

Big Data Ecosystem – Spark and Tableau

In this article we give you the big picture of how Big Data fits in your actual BI architecture and how to connect Tableau to Spark to enrich your current BI reports and dashboards with data ...Read More....

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Realtime Advanced Analytics: Spark Streaming+Kafka, MLlib/GraphX, SQL/DataFrames

How Concur uses Big Data to get you to Tableau Conference On Time

MySQL and Hive Configuration

Tips to Create Proposal

Spark Streaming More