Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
NLP, Text Mining and Machine Learning starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, keyword extraction with TFIDF, Text Classification with Logistic Regression, word count with pyspark, simple text pre
Example project implementing best practices for PySpark ETL jobs and applications.
State of the Art Natural Language Processing
This is a repo documenting the best practices in PySpark.
:truck: Agile Data Science Workflows made easy with Pyspark
Microsoft Machine Learning for Apache Spark
Process Human Text in TensorFlow / Sklearn / PySpark
A boilerplate for writing PySpark Jobs
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
Code base for the Learning PySpark book (in preparation)
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
A curated list of awesome Apache Spark packages and resources.
75+ DevOps CLI Tools - Spark, HBase, Hadoop, Log Anonymizer, Ambari Blueprints, AWS CloudFormation, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Elasticsearch, Solr, Hive, Impala, Pig, Travis CI, IPython - Python
Jupyter magics and kernels for working with remote Spark clusters
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
PySpark-Tutorial provides basic algorithms using PySpark
PySpark + Scikit-learn = Sparkit-learn
Finding policies that lead to optimal outcomes for an organization are some of the most difficult challenges facing decision makers within an organization.
My website: https://www.datamaking.com/ My blog: https://www.datasciencewiki.com/ PySpark 101 Tutorial: ...
In this video I have explained about how to read hive table data using the HiveContext which is a SQL execution engine. I have explained using pyspark shell ...
Filmed at PyData Barcelona 2017 https://pydata.org/barcelona2017/schedule/presentation/42/ www.pydata.org PyData is an educational program of ...
PyData Amsterdam 2016 Description This talk assumes you have a basic understanding of Spark (if not check out one of the intro videos on youtube ...