How to Read Text File From Sc Scala

Spark supports multiple input and output sources to save the file. It can admission data in input format and output format functions using Hadoop map-reduce, which is available in about file formats and file systems such as Text Files, JSON Files, CSV and TSV Files, Sequence Files, Object Files, and Hadoop Input and Output Formats. Let us begin

File Formats

  • Text Files
  • JSON Files
  • CSV and TSV Files
  • Sequence Files
  • Object Files
  • Hadoop Input and Output Formats

File Systems

  • Structured Data with Spark SQL
  • Apache Hive
  • Databases

File Formats

Spark provides a simple fashion to load and save data files in a very large number of file formats. These formats may range from existence unstructured, like text, to semi-structured, like JSON, to structured, like sequence files. The input file formats that Spark wraps are transparently handled in a compressed format based on the file extension specified.

Interested in learning Apache Spark? Click here to larn more than from this Apache Spark and Scala Grooming!

Watch this video on 'Apache Spark Tutorial':

Loading and Saving Your Data in Spark Loading and Saving Your Data in Spark

Text Files

Text files are very simple and convenient to load from and save to Spark applications. When we load a unmarried text file as an RDD, then each input line becomes an element in the RDD. Information technology can load multiple whole text files at the aforementioned time into a pair of RDD elements, with the key beingness the proper name given and the value of the contents of each file format specified.

  • Loading the text files: Loading a single text file is as simple as calling the textFile() function on our SparkContext with the pathname placed side by side to the file, equally shown below:
input = sc.textFile("file:///domicile/holden/repos/spark/README.medico")
  • Saving the text files: Spark consists of a part chosen saveAsTextFile(), which saves the path of a file and writes the content of the RDD to that file. The path is considered as a directory, and multiple outputs volition be produced in that directory. This is how Spark becomes able to write output from multiple codes.
  • Case:
result.saveAsTextFile(outputFile)

JSON Files

JSON stands for JavaScript Object Note, which is a low-cal-weighted data interchange format. It supports text but which can exist hands sent and received from a server. Python has an inbuilt package named 'json' to support JSON in Python.

  • Loading the JSON Files: For all supported languages, the approach of loading information in the text form and parsing the JSON data tin be adopted. Here, if the file contains multiple JSON records, the programmer volition accept to download the entire file and parse each i by 1.
  • Saving the JSON Files: In comparison to loading the JSON files, writing to it is much easier as, here, the developer does non have to worry about the wrong format of data values. The aforementioned libraries can exist used that were used to catechumen the RDDs into parsed JSON files; withal, RDDs of the structured information volition be taken and converted into RDDs of strings.

Desire to grab more detailed cognition on Hadoop? Read this extensive Spark Tutorial!

Get 100% Hike!

Main Near in Demand Skills Now !

CSV and TSV Files

Comma-separated values (CSV) files are a very common format used to store tables. These files have a definite number of fields in each line the values of which are separated past a comma. Similarly, in tab-separated values (TSV) files, the field values are separated by tabs.

  • Loading the CSV Files: The loading process of CSV and TSV files is quite similar to that of the JSON files. To load a CSV/TSV file, its content in the text format is loaded at first, and then it is candy. Like the JSON files, CSV and TSV files likewise take different library files, but it is suggested to use merely those corresponding to each language.
  • Saving the CSV Files: Writing to CSV/TSV files are besides quite easy. Yet, as the output cannot have the file name, mapping is required for better results. One easy style to perform this is to write a function that can convert the fields into positions in an array.

Sequence Files

A sequence file is a flat file that consists of binary key/value pairs and is widely used in Hadoop. The sync markers in these files permit Spark to find a particular point in a file and re-synchronize information technology with record limits.

  • Loading the Sequence Files: Spark comes with a specialized API that reads the sequence files. All we have to do is call a sequence file (pat, keyClass, valueClass, minPartitions), and access can be obtained from SparkContext.
  • Saving the Sequence Files: To salve the sequence files, a paired RDD, along with its types to write, is required. For several native types, implicit conversions between Scala and Hadoop Writables are possible. Hence, to write a native type, nosotros have to relieve the paired RDD by calling the saveAsSequenceFile(path) part. And then, we have to map over the information and convert it before saving information technology if the conversion is not automatic.

Object Files

Object files are the packaging effectually sequence files that enables saving RDDs containing value records only. Saving an object file is quite simple equally it just requires calling saveAsObjectFile() on an RDD.

Be familiar with these Top Spark Interview Questions and Answers and go a head start in your career!

Hadoop Input and Output Formats

The input split is referred to as the information present in HDFS. Spark provides APIs to implement the InputFormat of Hadoop in Scala, Python, and Java. The old APIs were Hadoop RDD and Hadoop files, but now the APIs have been improved and the new APIs are known equally newAPIHadoopRDD and newAPIHadoopFile.

For HadoopOutputFormat, Hadoop takes TextOutputFormat in which the key and value pair are separated through a comma and saved in a part file. Spark has the APIs of Hadoop for both MapRed and MapReduce.

  • File Compression: For most of the Hadoop outputs, a compression lawmaking tin can exist specified which is hands attainable. It is used to compress the data.

File Systems

A broad array of file systems are supported by Apache Spark. Some of them are discussed below:

  • Local/Regular FS: Spark tin load files from the local file system, which requires files to remain on the aforementioned path on all nodes.
  • Amazon S3: This file system is suitable for storing large amounts of files. Information technology works faster when the computed nodes are inside Amazon EC2. However, at times, its performance goes downwardly if we opt for the public network.
  • HDFS: It is a distributed file arrangement that works well on commodity hardware. Information technology provides high throughput.

If you want to know nigh Steps for the installation of Kafka, refer to this insightful Web log!

Structured Information with Spark SQL

Information technology works effectively on semi-structured and structured information. Structured data tin can be defined equally schemas, and it has a consequent set up of fields.

Apache Hive

One of the common structured data sources on Hadoop is Apache Hive. Hive tin can store tables in a multifariousness and different range of formats, from plain text to cavalcade-oriented formats, within HDFS, and it as well contains other storage systems. Spark SQL can load any amount of tables supported past Hive.

Databases

Spark supports a broad range of databases with the help of Hadoop Connectors or Custom Spark Connectors. Some of them are JDBC, Cassandra, HBase, and Elasticsearch.

Intellipaat provides the most comprehensive Cloudera Spark Course to fast-track your career!

rankinchfur1948.blogspot.com

Source: https://intellipaat.com/blog/tutorial/spark-tutorial/loading-and-saving-your-data/

0 Response to "How to Read Text File From Sc Scala"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel