Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, keyword extraction with TFIDF, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and
Example project implementing best practices for PySpark ETL jobs and applications.
State of the Art Natural Language Processing
This is a repo documenting the best practices in PySpark.
:truck: Agile Data Science Workflows made easy with Pyspark
Microsoft Machine Learning for Apache Spark
Code, for Natural Language Processing, and Text Generation, in TensorFlow 2.x / 1.x
A boilerplate for writing PySpark Jobs
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
Code base for the Learning PySpark book (in preparation)
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
A curated list of awesome Apache Spark packages and resources.
80+ DevOps & Data CLI Tools - AWS, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, Ambari, Blueprints, CloudFormation, Elasticsearch, Solr, Pig, IPython
Jupyter magics and kernels for working with remote Spark clusters
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
PySpark-Tutorial provides basic algorithms using PySpark
PySpark + Scikit-learn = Sparkit-learn
Welcome to DWBIADDA's Pyspark tutorial for beginners, as part of this lecture we will see, How to delete duplicate records from dataframe, how to delete ...
Finding policies that lead to optimal outcomes for an organization are some of the most difficult challenges facing decision makers within an organization.
My website: https://www.datamaking.com/ My blog: https://www.datasciencewiki.com/ PySpark 101 Tutorial: ...
In this video I have explained about how to read hive table data using the HiveContext which is a SQL execution engine. I have explained using pyspark shell ...
Filmed at PyData Barcelona 2017 https://pydata.org/barcelona2017/schedule/presentation/42/ www.pydata.org PyData is an educational program of ...
PyData Amsterdam 2016 Description This talk assumes you have a basic understanding of Spark (if not check out one of the intro videos on youtube ...